Opening — Why this matters now

The AI agent market is beginning to look like an overconfident airport duty-free shop: everything claims to be premium, every label promises capability, and somehow the thing you need is still hard to find.

That matters because the next phase of business automation will not be built from one general chatbot sitting politely in a browser tab. It will involve agent ecosystems: finance agents, customer-support agents, coding agents, compliance agents, research agents, scheduling agents, procurement agents, and a thousand microscopic “I can do that” assistants wrapped in glossy product pages.

The practical question is brutally simple: when a business has a real task, how does it find the right agent?

The paper behind this article, AgentSearchBench: A Benchmark for AI Agent Search in the Wild, treats that question as a search and ranking problem — but with a twist. Searching for agents is not the same as searching for documents, APIs, or tools. A document only needs to match the query. A tool usually has a defined function. An agent, unfortunately, has ambition.

Agent capability is compositional, context-dependent, and often only visible after execution. Two agents may describe themselves almost identically and perform very differently. Another agent may describe itself badly and still solve the task. In other words: marketing copy is not a benchmark. Shocking, I know.

AgentSearchBench formalizes this problem using nearly 10,000 real-world agents, execution-grounded relevance labels, and experiments across executable task queries and high-level task descriptions.1 Its central lesson is clear: agent discovery needs behavioral evidence. Retrieval based only on text descriptions is a fragile foundation for enterprise automation.

Background — Context and prior art

Traditional search assumes that relevance can be inferred from content. If a document contains the right terms or has the right semantic embedding, it probably belongs somewhere near the top of the results. Tool retrieval works similarly: a tool has a name, description, schema, input-output pattern, and a relatively bounded purpose.

Agents are messier.

An agent may include a language model, tools, memory, planning behavior, external APIs, browsing, code execution, multimodal input, and hidden prompt logic. Its “capability” is not merely what its profile says. Capability emerges from how it behaves when asked to solve a task.

This is why agent search differs from conventional retrieval:

Search target What is indexed What relevance usually means Why agent search is harder
Document Text content Topical match Static content is usually enough
API/tool Description, schema, endpoint Functional match Functionality is bounded and explicit
LLM/model Model card, benchmark scores Capability estimate Broad capability, but usually evaluated as a model
AI agent Description, tools, examples, behavior Task performance under execution Capability is compositional, inconsistent, and task-dependent

Existing tool and agent benchmarks often assume a cleaner world: well-specified tools, executable queries, controlled candidate pools, or small numbers of agents. AgentSearchBench deliberately walks into the messier room. Its agent collection comes from public agent ecosystems, including the GPT Store, Google Cloud Marketplace, and AgentAI Platform. The result is not laboratory purity; it is closer to the marketplace businesses will actually face.

The paper compares AgentSearchBench with earlier retrieval and agent benchmarks:

Benchmark Target Candidate count Realistic source? Task type
ToolBench Tool 16,464 Yes Executable
ToolRet Tool 43,215 Yes Executable
TREC 2025 LLM 1,131 Yes Executable
AgentSquare Agent 16 No Executable
OKC Bench Agent 127 No Executable
AgentSearchBench Agent 9,759 Yes Executable + non-executable

The important addition is not merely size. It is the decision to evaluate agents using execution-grounded performance rather than textual similarity. That turns agent search from “which profile sounds relevant?” into “which agent actually performs?”

A small distinction. Also the whole point.

Analysis — What the paper does

AgentSearchBench frames agent discovery as two connected problems:

  1. Retrieval: find candidate agents from a large repository.
  2. Reranking: order the retrieved candidates according to expected task performance.

The benchmark evaluates both under two kinds of user input.

The first is an executable task query: a concrete instruction that can be run and judged. For example, “summarize this article into a hierarchical mind map.” The second is a high-level task description: a broader, non-executable goal such as “monitor, summarize, and compare news and web sources.” This is much closer to how business users speak. They describe outcomes, not clean benchmark prompts.

The paper’s formulation is elegant because it does not pretend that a high-level task can be judged directly. Instead, each task description is associated with multiple executable task queries. Relevance is then measured through performance across those concrete instances. That matters because a business does not want an agent that succeeds once by accident. It wants an agent with repeatable capability.

The benchmark construction pipeline has four important steps:

Step What happens Why it matters
Agent collection Gather real-world agents from public platforms Captures messy documentation, overlapping capabilities, and inconsistent metadata
Task query construction Generate executable tasks from agent documentation Produces concrete tasks that can be run and evaluated
Task description construction Abstract higher-level objectives from clusters of task queries Simulates realistic business requests that are not directly executable
Relevance annotation Execute agents and judge outputs with performance scores Grounds relevance in behavior, not description text

The paper also standardizes heterogeneous agent profiles into a unified schema. This schema includes metadata, capability descriptions, usage guidance, and availability constraints. That is a useful design move for enterprises because agent discovery is not just about capability. It is also about provenance, versioning, access, pricing, modality, and deployment feasibility.

Here is the benchmark at a glance:

Benchmark statistic Value
Total agents 9,759
Agents with executable interfaces 7,867
Total tasks 3,211
Single-agent task queries 2,452
Multi-agent task queries 500
High-level task descriptions 259
Average executable queries per description 10
Average evaluated agents per query 20
Total execution runs 66,740

This design gives the benchmark its business relevance. It tests not just whether a search model can find something plausible, but whether it can find agents that actually complete the work.

Findings — Results with visualization

The paper’s results are not flattering to simple agent search. They are useful precisely because they are inconvenient.

Finding 1: Text retrieval finds plausible agents, not necessarily competent ones

Across executable task queries, tool-aware retrieval methods perform best. That makes sense: concrete tasks look closer to tool-use problems, so tool-specialized retrievers have an advantage. On high-level task descriptions, dense retrievers become more competitive because they are better at semantic abstraction.

But the absolute numbers show the problem. Even the best retrieval methods struggle to recover complete capability sets, especially for abstract descriptions.

Retrieval setting Strong method highlighted in the paper NDCG@5 Completeness@20 Business interpretation
Executable task query ToolRet 37.52 57.53 Concrete prompts are searchable, though far from solved
High-level task description BGE-Large v1.5 23.08 3.37 Abstract business goals break ordinary retrieval
High-level task description ToolRet 21.15 3.37 Tool-aware search also struggles when requirements are implicit

Completeness is the enterprise-sensitive metric here. A search result can look relevant and still fail to cover all required subtasks. For a business workflow, partial capability is not always partial success. Sometimes it is just a politely delayed failure.

Consider an AI procurement workflow. An agent may summarize vendor contracts well, but fail to flag renewal clauses. Another may extract price tables, but miss indemnity language. Both are “contract agents.” Neither should be deployed alone for contract review.

Agent search must therefore optimize for coverage, not merely topical similarity.

Finding 2: Reranking helps, but ordering is not the same as assurance

Reranking improves the ordering of candidate agents. Larger rerankers and LLM-based rankers can infer latent requirements better than simple similarity methods, especially under high-level task descriptions.

Yet the benchmark shows that even strong rerankers often fail on completeness. The paper reports that GPT-5.2-based RankGPT achieves strong NDCG on task descriptions, but completeness at shallow ranks remains poor. Translation: the ranking looks intelligent, but the selected set may still not cover the full job.

Reranking observation What it means
Rerankers improve NDCG over random ordering Better reasoning helps order plausible candidates
LLM-based reranking performs strongly on high-level descriptions Larger models can infer unstated requirements
Completeness remains limited Better ordering does not guarantee full workflow coverage
Oracle rankings remain far ahead Many strong agents are still buried too low

This is the semantic-performance gap at the center of the paper. Documentation-based matching does not reliably predict execution performance. Agent profiles are a weak proxy for competence.

For business automation, that gap becomes operational risk. A workflow orchestrator that picks agents based on profile similarity may route work to agents that sound correct but fail under real conditions. That is not automation. That is outsourcing decision quality to a brochure.

The most practically important part of the paper is its execution-aware probing analysis.

The authors test whether lightweight behavioral signals can improve ranking. They consider richer indexing with usage examples and explicit probing, where agents are asked probing queries and their responses become ranking signals.

The result: behavioral evidence helps. Full-document indexing generally improves performance over description-only indexing, suggesting that usage examples contain valuable behavioral clues. Explicit probes can also improve reranking, especially when probe responses vary enough to distinguish stronger agents from weaker ones.

Method Baseline NDCG@5 With probing Change
BGE Reranker v2 57.93 58.16 +0.40%
Tool-Rank 8B 60.82 61.71 +1.46%
Qwen Reranker 4B 60.96 61.91 +1.56%
RankGPT GPT-5.2 61.25 59.60 -2.69%

The gains are not magical. Good. Magic is usually what vendors call variance before finance calls it budget leakage.

The lesson is more sober: execution signals are useful, but they need to be designed carefully. Low-variance probes do not help much because every agent responds similarly. Useful probes are diagnostic. They expose behavioral differences.

That insight maps directly onto enterprise deployment. If a company is building an internal agent marketplace, it should not simply index descriptions. It should maintain a library of diagnostic test tasks, run agents against them, store results, and use those results in routing decisions.

Implications — What this means for business automation

AgentSearchBench points toward a future where agent marketplaces need something closer to agent due diligence.

A business does not need 500 agents that all claim to “automate operations.” It needs a reliable way to answer five questions:

  1. What can this agent actually do?
  2. Under what task conditions does it fail?
  3. Which subtasks does it cover, and which does it leave exposed?
  4. How stable is its performance over time?
  5. Is it safe, compliant, affordable, and auditable enough for this workflow?

The paper does not solve all of that. It does something more useful: it defines the missing layer. Agent discovery should combine retrieval, reranking, execution evaluation, coverage measurement, and behavioral probing.

For an enterprise, that suggests a practical architecture:

Layer Purpose Example business implementation
Agent registry Store identity, version, provider, access, model, tools, modality Internal catalog of approved agents and external candidates
Static index Search descriptions, tags, examples, documentation First-pass retrieval for candidate selection
Task library Define executable probes for common workflows Invoice extraction tests, support-ticket triage tests, compliance memo tests
Execution evaluator Score agent outputs using rubrics, human review, or LLM judges Accuracy, completeness, risk flags, style compliance, escalation quality
Coverage model Track which subtasks are satisfied Ensures multi-step workflows are not covered by one superficially relevant agent
Routing policy Select agents based on performance, cost, latency, and risk “Use Agent A for extraction, Agent B for exception review, human approval for high-risk cases”
Monitoring loop Re-test agents after updates and production failures Prevents silent degradation after model, prompt, or provider changes

This turns agent selection from a one-time search problem into a living evaluation system. That is where the real ROI sits.

A small firm may not need a full benchmark suite. But even a modest version is better than blind selection:

Maturity level Agent selection practice Risk profile
Level 0 Choose agents by name, popularity, or profile description Decorative automation; high failure risk
Level 1 Search by description and manually test a few examples Better, but ad hoc and hard to scale
Level 2 Maintain standard test tasks and output rubrics Repeatable validation for common workflows
Level 3 Use execution results in agent routing and orchestration Performance-aware automation
Level 4 Continuously monitor agent behavior, drift, cost, and compliance Operationally governed agent ecosystem

The sharp business implication is that agent marketplaces will need to move from discovery UX to capability assurance. Search bars are not enough. Star ratings are not enough. “Powered by advanced AI” is not enough, though it remains a useful warning label for people who enjoy disappointment.

Where the paper is especially useful

The paper is valuable because it addresses a problem that many agent discussions politely avoid: selection.

Most agent writing focuses on what agents can do after they are chosen. But enterprise workflows require choosing, combining, replacing, and auditing agents. That is a routing problem, a governance problem, and an evaluation problem.

AgentSearchBench is especially useful in three ways.

First, it separates semantic relevance from execution relevance. This distinction should become standard in agent procurement and orchestration. An agent that sounds relevant is not necessarily useful.

Second, it introduces high-level task descriptions as first-class search inputs. That is critical because managers do not usually ask for “a single-agent executable query with fully specified inputs.” They ask for outcomes: reduce claims backlog, summarize weekly sales risks, monitor supplier issues, prepare audit evidence. The benchmark acknowledges that abstraction is not an edge case. It is the normal case.

Third, it highlights completeness. In real workflows, missing one subtask can undermine the whole chain. A customer-support agent that classifies sentiment but misses refund eligibility is not “mostly correct.” It is a future escalation ticket wearing a nice hat.

Limits and open questions

The paper is strong, but its own design points to hard unresolved issues.

The first is evaluation dependency. AgentSearchBench uses LLM-as-judge scoring and validates it against human judgments. That is reasonable for scale, but enterprise systems will still need domain-specific rubrics, audit logs, and human review for high-risk tasks.

The second is execution cost. Probing thousands of agents across many tasks is expensive. Businesses will need selective probing strategies: test high-risk workflows more deeply, sample low-risk workflows more lightly, and rerun tests when agents change.

The third is privacy and security. Public benchmark tasks are one thing. Real enterprise tasks may involve contracts, customer data, medical records, invoices, or employee information. Execution-aware search must be designed with data boundaries, sandboxing, and logging from the start.

The fourth is drift. Agents are not static artifacts. Providers update models, prompts, tools, and policies. An agent that performed well last month may behave differently today. Capability assurance must therefore be continuous, not ceremonial.

These are not objections to the paper. They are reasons the paper matters.

Conclusion — The search engine for agents needs a test bench

AgentSearchBench makes a simple but consequential argument: AI agent search cannot rely on description matching alone. The right unit of evidence is not what an agent says it can do. It is what the agent does when tested.

For businesses, this changes how agentic automation should be evaluated. Selecting agents is not a procurement checkbox or a marketplace browsing exercise. It is an operational discipline: define tasks, test behavior, measure coverage, rerank by execution, monitor drift, and keep humans in the loop where failure has real cost.

The agent economy will not be won by the firms with the longest list of agents. It will be won by the firms that know which agents can actually perform, under which conditions, at what cost, and with what residual risk.

Search is useful. Receipts are better.

Cognaptus: Automate the Present, Incubate the Future.


  1. Bin Wu, Arastun Mammadli, Xiaoyu Zhang, and Emine Yilmaz, “AgentSearchBench: A Benchmark for AI Agent Search in the Wild,” arXiv:2604.22436, 2026. https://arxiv.org/abs/2604.22436 ↩︎