Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

At a Glance

Parallel pools of simple agents find quick, reliable improvements at scale, while small coordinated expert teams produce fewer but more diverse and complex code changes—combine both for best results.

What They Found

Parallel "subagents" that run many short experiments in isolation deliver rapid, stable improvements and the largest volume of valid proposals under tight time limits. Coordinated expert teams spend more time talking before running, so they produce fewer successful runs but often push through multi-part, structural changes that single agents or subagents miss. A single-agent baseline quickly plateaus and struggles to escape local tweaks without external coordination. Event-Driven Agent Pattern

Data Highlights

1Under a 300-second budget and 50 rounds, subagent mode produced 7 effective improvements while agent teams produced 3.
2Experiments used two fixed time budgets (300s and 600s) to compare early-stage speed versus deeper deliberation.
3Prior large multi-agent research system ran 417 hours, used 21.6 billion tokens, and cost about $186,000 to generate 166 papers—showing large cost differences between lightweight and heavy setups.

What This Means

Engineers building automated research or continuous-model-tuning systems should use subagents for broad hyperparameter sweeps and fast validation because they maximize successful, low-risk proposals. Technical leads and researchers aiming for novel algorithmic changes should reserve coordinated expert teams for deeper refactors and coupled architectural edits, accepting higher failure risk and slower throughput. For guidance on iterative planning and coordination, consider the Planning Pattern.
Not sure where to start?Get personalized recommendations
Learn More

Key Figures

Figure 1: Multi-Agent Coordination Frameworks
Fig 1: Figure 1: Multi-Agent Coordination Frameworks
Figure 2: Autoresearch Progress
Fig 2: Figure 2: Autoresearch Progress
Figure 3: Ratio of each phase.
Fig 3: Figure 3: Ratio of each phase.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Results come from a controlled testbed that optimizes a single training script and may not generalize to all codebases or research domains. Experiments ran on modest GPU hardware and used a specific pair of large language models, so absolute numbers could change with different models or more compute. Communication was limited to text diffs and handoffs; richer inter-agent protocols or safeguards could alter the trade-offs observed. For a discussion of integration and protocol considerations, see the Agent-to-Agent Protocol.

Methodology & More

A reproducible execution testbed compared three organization styles for automated machine-learning research under strict wall-clock limits: a single-agent baseline that proposes and tests sequentially; a subagent architecture with many isolated workers that propose in parallel and a central coordinator to merge wins; and an expert team setup where specialized agents deliberate before running experiments. The task was to iteratively edit a training script to reduce validation bits-per-byte, accepting only patches that lower that metric. Budgets of 300 and 600 seconds per run enforced a focus on efficiency rather than absolute capability. Key findings show a clear trade-off. Subagents produce many more valid proposals, recover quickly from individual failures (isolated worktrees prevent one crash from halting progress), and achieve faster early gains—making them ideal for broad, shallow searches. Expert teams, by investing context and calls into pre-execution discussion, produce fewer successful runs but those runs are more diverse and often combine multiple coordinated changes (for example, attention patterns, scheduling, and embedding sizes in one patch). Stability metrics reveal subagents have lower preflight and crash rates, while teams are more fragile because sequential edits compound integration risks. Practical takeaway: route work dynamically—use subagents for high-throughput exploration and call in expert teams for high-reward, structural innovations, while investing in better communication and validation to reduce team fragility. For robustness concerns like potential deadlocks, consider the Coordination Deadlock.
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

No notable affiliations provided and authors have low h-indices (mostly <10), indicating emerging/limited signals.