Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

Co-evolving many evaluation functions and letting them vote on agreement produces a stable objective for automated discovery and helped an AI pipeline find a solver that is ~67× faster on large instances.

The Evidence

A four-agent AI loop that co-evolves solution designs and many proxy evaluation functions produced a robust consensus objective that resisted misleading metrics. Using correlation-weighted voting (giving more weight to objectives that agree with others) and an age-decay on older objectives, the system stayed stable while objectives shifted as research progressed. Applied to a hard combinatorial problem, the pipeline explored hundreds of designs and discovered a solver whose scaling improved dramatically, yielding a large speedup on big instances.

Data Highlights

1Explored 414 distinct solver designs under search guidance.
2Used 42 co-evolving, LLM-generated proxy objectives to form the consensus ranking.
3Best solver reduced scaling from ~N^2.51 to ~N^1.33, giving ~67× speedup at N = 1810 (median steps: 95,503 → 6,369,516).

What This Means

Engineers building automated research or multi-agent pipelines can use consensus objective aggregation to reduce reward-hacking and get steadier progress. Technical leaders evaluating AI-driven discovery tools will find this approach useful for improving reliability and for tracking which evaluation signals truly matter. Researchers in algorithm discovery can adopt the closed-loop design to explore large design spaces more efficiently.
Not sure where to start?Get personalized recommendations
Learn More

Key Figures

Figure 1: Framework overview. The system consists of four LLM agents in an iterative cycle. Starting from a high-level human-designed research goal, the meta-agent sets the research strategy, guiding objective generation and analyzing objective quality. The objective agent proposes proxy objective functions reflecting different aspects of solution quality; these feed into a consensus objective that aggregates rankings through correlation-weighted voting. The planner agent uses Monte Carlo Graph Search (MCGS) under the consensus objective to identify strategic research directions. The designer agent turns those directions into concrete implementations, which are tested by a multi-fidelity execution engine that allocates computational budget to the most promising designs. A hyperparameter optimization module periodically tunes the leading design’s parameters.
Fig 1: Figure 1: Framework overview. The system consists of four LLM agents in an iterative cycle. Starting from a high-level human-designed research goal, the meta-agent sets the research strategy, guiding objective generation and analyzing objective quality. The objective agent proposes proxy objective functions reflecting different aspects of solution quality; these feed into a consensus objective that aggregates rankings through correlation-weighted voting. The planner agent uses Monte Carlo Graph Search (MCGS) under the consensus objective to identify strategic research directions. The designer agent turns those directions into concrete implementations, which are tested by a multi-fidelity execution engine that allocates computational budget to the most promising designs. A hyperparameter optimization module periodically tunes the leading design’s parameters.
Figure 2: Consensus objective aggregation. (a) Kendall’s τ \tau correlation matrix across 42 LLM-generated objectives. Most objectives are positively correlated (red), but a few outliers have negative correlations with the majority (blue), showing that LLM-generated objectives can indeed be misleading sometimes. (b) Consensus weights after correlation-weighted voting with age decay ( λ = 0.9 \lambda=0.9 ). Newer objectives are weighted higher due to age decay, while some older objectives still carry a large weight from the correlated majority. Outlier objectives with low agreement are suppressed. (c) PCA of objectives based on pairwise τ \tau correlations, colored by objective ID (creation order); larger dots denote higher weight. Earlier objectives form a small, isolated cluster, while later objectives converge towards another larger cluster, reflecting the shift in the research goal as research progresses. Outlier objectives are visually separated and get suppressed in the consensus ranking.
Fig 2: Figure 2: Consensus objective aggregation. (a) Kendall’s τ \tau correlation matrix across 42 LLM-generated objectives. Most objectives are positively correlated (red), but a few outliers have negative correlations with the majority (blue), showing that LLM-generated objectives can indeed be misleading sometimes. (b) Consensus weights after correlation-weighted voting with age decay ( λ = 0.9 \lambda=0.9 ). Newer objectives are weighted higher due to age decay, while some older objectives still carry a large weight from the correlated majority. Outlier objectives with low agreement are suppressed. (c) PCA of objectives based on pairwise τ \tau correlations, colored by objective ID (creation order); larger dots denote higher weight. Earlier objectives form a small, isolated cluster, while later objectives converge towards another larger cluster, reflecting the shift in the research goal as research progresses. Outlier objectives are visually separated and get suppressed in the consensus ranking.
Figure 3: Algorithm discovery for 3-SAT DMM solvers. (a) Design genealogy graph. Nodes are solver designs colored by ID (chronological); larger nodes rank higher under the consensus. Directed edges represent the reference weights. Sub-tree structures arise from merging several exploratory workspaces, with diverse lineages converging toward the best design 340. (b) Scaling of median solution steps with problem size N N . The baseline DMM solver scales as N 2.51 ± 0.06 N^{2.51\pm 0.06} ; the best solver (design 340) scales as N 1.33 ± 0.07 N^{1.33\pm 0.07} —a ∼ 67 × \sim 67\times speedup at N = 1810 N=1810 (95,503 vs. 6,369,516 median steps). Each median is computed over 100 random 3-SAT instances at fixed N N ; reported uncertainties on the exponents and the shaded 1 σ \sigma envelopes around the fit lines come from the log-log linear-regression covariance. Background points cover all 414 designs, colored by ID to show progressive improvement across the search. Note that earlier designs frequently perform better on smaller N N , but later designs achieve much stronger performance on larger N N .
Fig 3: Figure 3: Algorithm discovery for 3-SAT DMM solvers. (a) Design genealogy graph. Nodes are solver designs colored by ID (chronological); larger nodes rank higher under the consensus. Directed edges represent the reference weights. Sub-tree structures arise from merging several exploratory workspaces, with diverse lineages converging toward the best design 340. (b) Scaling of median solution steps with problem size N N . The baseline DMM solver scales as N 2.51 ± 0.06 N^{2.51\pm 0.06} ; the best solver (design 340) scales as N 1.33 ± 0.07 N^{1.33\pm 0.07} —a ∼ 67 × \sim 67\times speedup at N = 1810 N=1810 (95,503 vs. 6,369,516 median steps). Each median is computed over 100 random 3-SAT instances at fixed N N ; reported uncertainties on the exponents and the shaded 1 σ \sigma envelopes around the fit lines come from the log-log linear-regression covariance. Background points cover all 414 designs, colored by ID to show progressive improvement across the search. Note that earlier designs frequently perform better on smaller N N , but later designs achieve much stronger performance on larger N N .
Figure S1: Gating components of the release term in design 340. (a) Long-term memory gate gate x l \mathrm{gate}_{x_{l}} : activates only when x l , m x_{l,m} exceeds the threshold x l , thr = 1000 x_{l,\mathrm{thr}}=1000 (dashed red line). (b) Weak-satisfaction band: peaks when c m c_{m} falls between η \eta and γ \gamma ; note that the optimized values η = 0.336 > γ = 0.282 \eta=0.336>\gamma=0.282 produce a narrow, low-amplitude response (peak ≈ 0.16 \approx 0.16 ). (c) Bounded upward push: linearly decreasing in x s , m x_{s,m} , reaching zero at x s ∗ = 0.091 x_{s}^{*}=0.091 . (d) Tail-safety gate at three levels of gate x l \mathrm{gate}_{x_{l}} , showing how the x l x_{l} -dependent shift extends the safe operating range. (e) Amplitude normalization with power-law decay and clause-state-driven floor at three levels of floor ​ _ ​ gate \mathrm{floor\_gate} . (f) Combined release magnitude in the ( c m , x l , m ) (c_{m},x_{l,m}) plane at x s , m = 0.05 x_{s,m}=0.05 , showing the narrow region where all gates are simultaneously open.
Fig 4: Figure S1: Gating components of the release term in design 340. (a) Long-term memory gate gate x l \mathrm{gate}_{x_{l}} : activates only when x l , m x_{l,m} exceeds the threshold x l , thr = 1000 x_{l,\mathrm{thr}}=1000 (dashed red line). (b) Weak-satisfaction band: peaks when c m c_{m} falls between η \eta and γ \gamma ; note that the optimized values η = 0.336 > γ = 0.282 \eta=0.336>\gamma=0.282 produce a narrow, low-amplitude response (peak ≈ 0.16 \approx 0.16 ). (c) Bounded upward push: linearly decreasing in x s , m x_{s,m} , reaching zero at x s ∗ = 0.091 x_{s}^{*}=0.091 . (d) Tail-safety gate at three levels of gate x l \mathrm{gate}_{x_{l}} , showing how the x l x_{l} -dependent shift extends the safe operating range. (e) Amplitude normalization with power-law decay and clause-state-driven floor at three levels of floor ​ _ ​ gate \mathrm{floor\_gate} . (f) Combined release magnitude in the ( c m , x l , m ) (c_{m},x_{l,m}) plane at x s , m = 0.05 x_{s,m}=0.05 , showing the narrow region where all gates are simultaneously open.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Results come from a single end-to-end run on a specific planted 3-SAT problem, so run-to-run variance and cross-domain generality are not yet quantified. All objectives were generated by the same underlying language model, which can create shared blind spots—agreement can reinforce common biases. The system also required structured guardrails to avoid failure modes like goal drift or evaluation-script tampering, so fully autonomous deployments may need stronger safeguards.

Methodology & More

A four-role AI loop (meta-agent, objective agent, planner, designer) iteratively searches design space while evolving how it judges designs. The objective agent produces many proxy objective functions (different ways to score experiments). Pairwise agreement between objectives is measured with a rank correlation (Kendall’s tau, a metric that checks how similarly two rankings order items); objectives that agree with the majority get higher weight and older objectives fade via an exponential age-decay. The weighted preferences are combined with a ranked-voting method (a Borda count) to produce a single consensus ranking that guides a Monte Carlo search for promising directions. Promising designs undergo multi-fidelity testing and periodic hyperparameter tuning to focus compute on the best candidates. Putting this meta-optimization loop on a hard combinatorial benchmark produced concrete gains: over 414 designs and 42 objectives, the system found a solver (design 340) whose median step-scaling moved from roughly N^2.51 to N^1.33, yielding about a 67× speedup at the largest tested size. The consensus mechanism suppressed outlier or misleading objectives and allowed the evaluation criteria to shift smoothly as the search matured. While promising for broader automated discovery, the approach still needs repeated runs across domains to measure robustness, and teams should watch for shared model biases and the trade-offs between structured safety and exploratory freedom.
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

Includes Massimiliano Di Ventra, a well-known established researcher; mixed author h-indices but presence of a senior, cited researcher raises credibility.