Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

Text-based descriptions are a poor proxy for real ability: searching agents by running small, targeted tests (lightweight probes) finds better assistants than matching on documentation alone.

The Evidence

Real agent marketplaces contain many overlapping and partially capable agents, so similar descriptions often hide wide performance differences. Retrieval methods that rely on textual similarity regularly miss high-performing agents, especially when users start from vague, high-level requests. Adding short, execution-based probes — a few tiny test runs — provides behavioral signals that substantially improve ranking and discovery. execution-based probes.

Data Highlights

1AgentSearchBench collects ~9,760 real-world agents, of which 7,867 provide executable interfaces.
2Benchmark includes 2,952 executable task queries and 259 high-level task descriptions (about 10 queries per description on average).
3Evaluation ran 66,740 executions (top-20 agents per query) to produce execution-grounded relevance labels for retrieval and ranking.

What This Means

Engineers building agent marketplaces or orchestration systems: incorporate execution signals into search and ranking to find better agents for users. Platform operators and technical leads evaluating third-party assistants: use small, automated probes to validate capability claims before composing or recommending agents. capability claims
Not sure where to start?Get personalized recommendations
Learn More

Key Figures

Figure 1: Task and Relevance Label Generation Pipeline of AgentSearchBench.
Fig 1: Figure 1: Task and Relevance Label Generation Pipeline of AgentSearchBench.
(a) Relevant agents per query.
Fig 3: (a) Relevant agents per query.
(a) Golden ranking accumulated agent scores on 2452 Single-Agent Task Queries.
Fig 4: (a) Golden ranking accumulated agent scores on 2452 Single-Agent Task Queries.
(a) NDCG@5: Comparison between Realistic and Synthetic Single-agent Task Queries.
Fig 5: (a) NDCG@5: Comparison between Realistic and Synthetic Single-agent Task Queries.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

AgentSearchBench is built from public platform agents and the particular query set and metrics affect which agents look best, so results may not fully generalize to closed or proprietary agents. Running probes requires executing third-party agents, which has cost, latency, and safety implications that must be managed. The study focuses on single-agent discovery and ranking; multi-agent composition and long-running behavior remain open challenges. planning-pattern

Methodology & More

Collected from real public marketplaces, AgentSearchBench assembles nearly 9,760 agents (7,867 with runnable interfaces) and creates two query types: executable task queries and higher-level, non-executable task descriptions. For each executable query the benchmark retrieves a candidate set (top 20 agents) and runs those agents on task instances to turn outcomes into graded relevance labels. That execution-grounded labeling is used to evaluate retrieval (finding capable agents) and reranking (ordering by quality). Experiments show a clear gap between semantic similarity and actual task performance: methods that score agents using descriptions alone often fail to surface the best-performing agents, especially when the user's request is abstract. Introducing lightweight behavioral probes — a few short executions designed to reveal capability — produces additional signals that markedly improve ranking quality. The implication for practitioners is practical: add small, automated tests to your discovery pipeline (and index their results) rather than relying solely on documentation or metadata to recommend agents in open ecosystems. semantic similarity and capability signals
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

Authors are affiliated with University College London (a well-regarded research institution). H-indexes are modest but the strong institutional backing raises credibility despite being an arXiv preprint.