Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

At a Glance

Simulating each person as a proxy agent and letting those agents debate produces deliverables that better preserve diverse viewpoints and help teams converge, compared to simply averaging inputs.

Key Findings

Representing individual preferences as proxy agents, running a structured discussion among those agents, and synthesizing the transcript into an editable deliverable helps surface trade-offs and preserves minority or conditional viewpoints. Across two teamwork tasks (text and multimodal), TeamFusion produced outputs that reviewers judged more representative of participants’ reasons and more consensus-inducing than direct aggregation. The approach scales: evaluations used 100 team settings and generalize across backbone models and team sizes. The system also keeps provenance of arguments, making outputs easier to audit and refine by humans structured discussion framework.

Data Highlights

1Evaluation covered two open-ended teamwork tasks and 100 distinct team settings.
2Agreement distributions shifted toward higher-agreement bins after TeamFusion’s discussion phase (agreement measured across five value ranges; dataset-wide move reported).
3Human-facing results (e.g., image selection and commentary ratings) were reported with statistical rigor (95% confidence intervals) and showed consistently higher preference for TeamFusion outputs versus direct aggregation baselines.

What This Means

Engineers building collaboration or decision-support tools: use this pattern to produce deliverables that keep traceable rationales and diverse views instead of a single averaged answer. Product managers and team leads running consultative or stakeholder-driven decisions can use TeamFusion-like systems to speed up consensus while preserving minority concerns for audit and revision audit and revision.
Avoid common pitfallsLearn what failures to watch for
Learn More

Key Figures

Figure 1: Illustration of TeamFusion versus baselines. While human discussion is slow and direct aggregation loses nuance, TeamFusion leverages agent-based discussion to combine fast execution with the high representation fidelity and maximum consensus.
Fig 1: Figure 1: Illustration of TeamFusion versus baselines. While human discussion is slow and direct aggregation loses nuance, TeamFusion leverages agent-based discussion to combine fast execution with the high representation fidelity and maximum consensus.
Figure 2: The overview of the TeamFusion framework. It consists of four phases: (1) Represent: We extract human preference labels as agents; (2) Discussion: The agents abstracted from human preference engage in a structured discussion; (3) Remix: The discussion transcript along with task context are remixed into a final deliverable used directly for downstream decision.; (4) Critique and Refine: The agent or human leave critiques based on generated deliverable, and the system iterates again on improving the deliverable.
Fig 2: Figure 2: The overview of the TeamFusion framework. It consists of four phases: (1) Represent: We extract human preference labels as agents; (2) Discussion: The agents abstracted from human preference engage in a structured discussion; (3) Remix: The discussion transcript along with task context are remixed into a final deliverable used directly for downstream decision.; (4) Critique and Refine: The agent or human leave critiques based on generated deliverable, and the system iterates again on improving the deliverable.
Figure 3: The distribution of agreement scores to measure dataset-wide agreement before and after TeamFusion’s execution. The data is categorized into five value ranges to interpret agreement strength. Agreements across 100 team settings after running TeamFusion show a dataset-wide move towards higher agreement.
Fig 3: Figure 3: The distribution of agreement scores to measure dataset-wide agreement before and after TeamFusion’s execution. The data is categorized into five value ranges to interpret agreement strength. Agreements across 100 team settings after running TeamFusion show a dataset-wide move towards higher agreement.
Figure 4: The rate of TeamFusion-generated images appearing in the final top-ranked selections. Error bars represent the 95% confidence interval.
Fig 4: Figure 4: The rate of TeamFusion-generated images appearing in the final top-ranked selections. Error bars represent the 95% confidence interval.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Current TeamFusion assumes a flat team structure and does not model role hierarchies or seniority, so outcomes may differ in real workplaces with power dynamics. The evaluation decouples preference collection from interaction to scale testing; live synchronous dynamics could change results. TeamFusion is designed to support—not replace—human judgment, so prompts, provenance logs, and final edits should remain transparent and under human control provenance and guardrails.

The Details

TeamFusion turns each team member’s stated preferences into a proxy agent, runs a structured multi-agent discussion to make agreements and disagreements explicit, and remixes the resulting transcript with the task context into an editable deliverable. The pipeline has four phases: represent (extract preferences), discussion (agent-to-agent exchange), remix (synthesize a deliverable grounded in the discussion), and critique and refine (iterative improvement by agents or humans). That design preserves the chain of reasoning so outputs are both attributable and auditable. four phases The framework was tested on two open-ended tasks (including civic comment synthesis and a multimodal design task) across 100 team settings and multiple backbone models. TeamFusion outperformed direct aggregation baselines on metrics for viewpoint coverage and consensus strength, and human reviewers preferred the deliverables and agent commentary. Practical implications: use agent-based discussion when you need transparent trade-offs and editable outputs; keep humans in the loop for final judgments; and be mindful to extend the model if your teams have hierarchical roles or asymmetric decision power. open-ended tasks
Need expert guidance?We can help implement this
Learn More
Credibility Assessment:

All authors affiliated with Adobe Research (well-regarded industry lab) and several mid-level h-indices; arXiv venue but strong institutional backing supports high credibility.