Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

Using different language models for each stakeholder drastically reduces artificial agreement in policy simulations, and a separate quality check on each agent’s reasoning further improves outcome reliability while sometimes narrowing diversity.

The Evidence

Running each stakeholder on a different model cuts the tendency of agents to all pick the same option, preserving genuine disagreement. Adding a coherence check — a single stronger model that scores whether each agent’s argument matches its assigned value — further refines results by downweighting low-fidelity agreements. The combination reveals a tradeoff: higher judged quality can slightly reduce surface-level diversity, but it surfaces which perspectives are reliably modeled and which are not. multi-agent system

Data Highlights

1First-choice concentration fell from 70.9% to 46.1% when moving from one shared model to a different model per perspective (child welfare scenario).
2Coherence validation reduced first-choice concentration further from 46.1% to 40.8% and lowered Borda margin from 0.218 to 0.184.
3Effective distinct perspectives rose from 0.74 to 1.07 with heterogeneity, and to 1.12 after coherence weighting (statistically significant, medium–large effects).

What This Means

Engineers building multi-stakeholder agent simulations should care because using diverse models is an effective, implementable way to avoid false consensus and surface real trade-offs. Technical leaders and product teams evaluating agent governance or agent-to-agent evaluation should use these methods to add trust signals and identify which stakeholder perspectives need stronger models. For broader applicability, see how Multi-Agent Government Services can benefit from such governance practices.
Not sure where to start?Get personalized recommendations
Learn More

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Results come from two policy scenarios and one seven-model pool, so generalization to other domains and model families isn’t guaranteed. Coherence validation currently depends on a single frontier model call, which breaks a fully local-only workflow and may reflect vocabulary overlap between options and perspectives. Lower-capability models can produce low-fidelity reasoning, creating a tradeoff where preserving diversity may require upgrading weaker models. For a related discussion of reasoning patterns though, consider the Chain of Thought Pattern.

Methodology & More

A multi-agent deliberation setup simulated three champion advocates and seven evaluators across two policy scenarios (child welfare and housing). Three experimental states were compared: all agents using the same model, each perspective using a different model, and the heterogeneous setup plus a coherence validation stage where a stronger model scored how well each evaluator’s reasoning matched its assigned value. Champions produced structured rounds of position statements, critiques, and defenses; evaluators then ranked options and provided reasoning. Key findings: using a different model per value perspective substantially reduced artificial consensus (roughly a 25 percentage point drop in top-choice concentration in the child welfare scenario), and a single-call coherence validator further lowered concentration while increasing the measured number of effective perspectives. Practical implications are straightforward: default to heterogeneous model pools for normative deliberation, add a reasoning-quality check to flag low-fidelity votes, and interpret divergences between weighted and unweighted outputs as signals about which perspectives are reliably modeled. Limitations include restricted scenario variety, a single local model pool, and dependency on a frontier model for validation; future work should test more domains, larger and varied model sets, and cheaper local validators. For a broader view of how organizations can orchestrate collaborations, explore Orchestrator-Worker Pattern and Semantic Capability Matching Pattern.
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

Single-author ArXiv preprint with no affiliation or h-index provided and zero citations — minimal identifiable signals.