Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

Structured, courtroom-style debates with iterative, query-driven retrieval raise claim-verification accuracy by about 10 percentage points (to 81.7%), mainly because expanding evidence during debate and mixing model roles prevents premature consensus and offsets individual model errors.

The Evidence

Framing claim checking as an adversarial courtroom — with Plaintiff, Defense, Judge, Critic, and Expert Witness roles — forces focused, contestable arguments and an explicit evidence-admission process. Iterative retrieval that adapts queries each round (progressive retrieval) is the biggest single reason for gains: it finds fresh, relevant evidence and stops debates from getting stuck on the same biased sources. Using a mix of different model types (heterogeneous panels) further improves results because models make complementary mistakes; combining them yields better group decisions than any single model. Adding role-switching and self-reflection also speeds debates and reduces token use while preserving accuracy.

Data Highlights

1Aggregate majority-vote accuracy = 81.7% (oracle ceiling 95.8%); overall structured deliberation produced a +10.0 percentage point gain versus unstructured multi-agent debate.
2Progressive retrieval (P-RAG) contributed +7.5 percentage points; removing it increased inter-judge agreement (κ from 0.468 → 0.599) but reduced accuracy by 7.5 points, showing confident but wrong convergence without dynamic evidence expansion.
3Panel design improves efficiency and robustness: heterogeneous model panels gave +3.3 points vs a single judge; self-reflection cut debate rounds by 29% (7.06 → 5.47) and token usage by 17% while keeping accuracy within 0.8 points.

What This Means

Engineers building AI agents for high-stakes verification (medical, regulatory, legal) should consider structured adversarial workflows and iterative retrieval to avoid evidence stagnation. Product leaders and evaluators should add role-based checks and model diversity to their evaluation pipelines to reduce confidently wrong outcomes and surface trust signals.
Not sure where to start?Get personalized recommendations
Learn More

Key Figures

(a) Novelty decay across rounds.
Fig 1: (a) Novelty decay across rounds.
Figure 4: Reflection score trajectories across plateau, judicial, and critic resolution patterns
Fig 2: Figure 4: Reflection score trajectories across plateau, judicial, and critic resolution patterns

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Computational cost is high: multiple roles, role-switching, and iterative retrieval consume many calls and tokens. Experiments use a curated PubMed abstract corpus and binary COVID-19 claims, so results may not transfer directly to open-web sources, nuanced claims, or non-medical domains. The evaluation enforces a legal-style burden of refutation (inconclusive defaults to supported), which affects how errors manifest and may not match every real-world policy setting.

Methodology & More

The system treats each claim like a court case: the claim is decomposed into independently testable premises, then opposing counsel agents (Plaintiff and Defense) build evidence-backed arguments while Judges, Critics, and Expert Witnesses evaluate admissibility and logical completeness. A progressive retrieval loop (P-RAG) runs each round, reformulating queries based on the debate so new, relevant passages get added to the evidence pool instead of repeatedly reusing the same sources. Agents score argument completeness and self-reflect to decide whether to continue, which supports early exits and reduces wasted rounds.
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

One author has moderate h-index (~16) (in the 10–20 range) but affiliations and venue are unspecified (arXiv); overall a recognized but not top-tier signal.