The Big Picture
Pairing two vision-and-language agents usually increases success on household tasks, but benefits peak at moderate team sizes — adding more agents creates coordination overhead that cancels gains.
ON THIS PAGE
The Evidence
Collaboration generally improves embodied task completion: most models gain from adding a second agent. Performance follows an inverted-U as team size grows — a small team often helps, while larger teams eventually decline in effectiveness. Communication drives these gains: centralized leadership helps small teams, shared memory helps parallel work, and vision-enabled leaders help when the leader can use visual input. Two-agent teams are also more robust when location information is missing or when priors include false locations. location information
Not sure where to start?Get personalized recommendations
Data Highlights
1Benchmark scale: MECoBench contains 96 tasks evaluated under parallel and sequential setups for a total of 192 test cases, with tasks having 2–7 subgoals each.
2Evaluation budgets: parallel experiments use a shared 60-step budget (so each agent gets 60/n steps for an n-agent team); sequential two-agent tasks use an 80-step budget.
3Noisy priors test: injecting 30% false object locations shows two-agent collaboration yields the largest relative gains over solo agents, demonstrating resilience to misleading information.
What This Means
Engineers building multi-agent systems and product leads designing AI helpers should prefer small, well-coordinated teams and invest in communication protocols. Researchers and evaluators should use multimodal, embodied benchmarks like MECoBench to test how vision-and-language agents handle partial, changing observations and spatial constraints. MECoBench
Key Figures

Fig 1: Figure 1: An illustration of realistic scenarios , where cross-modal multi-agent collaboration significantly improves the efficiency compared with single-agent.

Fig 2: Figure 2: Data construction pipeline of MECoBench . Each task is first grounded from a high-level task into a concrete scene, then set collaboration configuration for parallel or sequential execution.

Fig 3: Figure 3: Illustration of four protocols under three collaboration modes .

Fig 4: Figure 4: Overall workflow of evaluation. Agents receive task goals and prior information, perceive the environment, communicate under different protocols, reason with history and memory, and execute actions with feedback.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreYes, But...
MECoBench is built on an indoor household simulator, so results may not generalize to outdoor or long-range navigation tasks. Many experiments start with object-location priors (though ablations remove or corrupt them), which can alter difficulty compared with fully open search. Team performance depends on model ability—vision-augmented leadership helps only if the leader can actually interpret visuals well.
Methodology & More
MECoBench is a controlled benchmark and evaluation platform for studying how vision-and-language agents cooperate in embodied household environments. It includes 96 task templates (2–7 subgoals each) instantiated into 192 test cases, two spatial collaboration structures (parallel where agents share space, and sequential where agents operate in disjoint zones), variable team sizes, and multiple communication protocols (broadcast, centralized leader, shared memory, vision-augmented leadership). Experiments compare single-agent and multi-agent setups under fixed step budgets and include ablations on prior information (missing priors and 30% noisy priors). A wide range of multimodal language models were evaluated to surface general patterns rather than optimize a single system. MECoBench Key findings: adding a second agent typically improves task completion, but returns diminish and reverse as teams grow, producing an inverted-U relationship between team size and performance. parallel task pattern Communication is the main enabler: removing messaging degrades success—especially on sequential tasks and larger teams. consensus-based decision pattern Centralized leaders help small teams avoid conflicts; shared memory improves efficiency for loosely coupled parallel tasks; and a leader given visual input can coordinate better if the leader model can use that input. Finally, two-agent teams are notably more robust when priors are missing or contain noise, suggesting distributed exploration plus communication helps recover from misleading information. Practical takeaways are to prefer modest team sizes, prioritize clear communication mechanisms, and test systems under noisy or missing information to surface real-world failure modes.
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
Authors have low h-indices (2–6) and no institutional affiliations listed; venue is an arXiv preprint with zero citations. Signals point to emerging/limited credibility rather than established research.