When Multiple Language Models Help — and When They Waste Time

At a Glance

Teams of language models speed up work only when tasks can be cleanly split up; otherwise coordination, communication, and error-handling often erase gains and add cost.

ON THIS PAGE

What They Found

Scaling multiple language model agents follows the same limits as old-fashioned distributed computing: tasks that are highly decomposable get real speedups, while tasks with serial dependencies do not. Letting agents assign work themselves (decentralized coordination) cuts performance further because of conflicting edits, extra messages, and idle agents. Applying ideas from distributed systems—like clear task partitioning, scheduling, redundancy, and verification—Mutual Verification Pattern helps predict and reduce these problems.

Data Highlights

1Observed team speedup averaged 2.19× and was significantly below the theoretical Amdahl bound (Wilcoxon signed-rank p < 0.001).

2If 95% of a task is parallelizable, Amdahl’s Law predicts up to 20× speedup; if only 50% is parallelizable, the maximum is just 2×, showing heavy sensitivity to task structure.

3Speedup varied significantly across dependency types (Kruskal–Wallis H = 61.4, p < 0.001); decentralized self-coordination produced notably lower speedups and higher coordination overhead than preassigned tasks.

What This Means

Engineers building multi-agent systems should use these findings to decide when to split a job across models and when to keep a single agent. Technical leads and SREs can apply scheduling, verification, and monitoring ideas from distributed systems to reduce wasted compute and puzzling failures. Researchers evaluating multi-agent approaches should include coordination costs and failure modes, not just accuracy Market-Based Coordination Pattern.

Not sure where to start?Get personalized recommendations

Learn More

Key Figures

Figure 1: LLM Teams as Distributed Systems. Distributed computing provides a principled framework for analyzing and designing LLM teams. A. Both LLM team and distributed systems research pursue similar goals: leveraging scalability to improve performance and achieving fault tolerance through mechanisms such as redundancy, replication, and consensus. B. At the same time, LLM teams inherit fundamental complexities familiar from distributed systems but absent in single-agent settings, including consistency conflicts, architectural trade-offs, communication overhead, stragglers, task scheduling, and increased compute, energy, and monetary costs. C. LLM teams share four core properties with distributed systems: independence (each agent or node operates on local context without automatic access to global state); concurrency (multiple agents or nodes execute tasks simultaneously); communication (information is exchanged through message passing); and fallibility (agents or nodes may produce errors or undergo faults).

Fig 1: Figure 1: LLM Teams as Distributed Systems. Distributed computing provides a principled framework for analyzing and designing LLM teams. A. Both LLM team and distributed systems research pursue similar goals: leveraging scalability to improve performance and achieving fault tolerance through mechanisms such as redundancy, replication, and consensus. B. At the same time, LLM teams inherit fundamental complexities familiar from distributed systems but absent in single-agent settings, including consistency conflicts, architectural trade-offs, communication overhead, stragglers, task scheduling, and increased compute, energy, and monetary costs. C. LLM teams share four core properties with distributed systems: independence (each agent or node operates on local context without automatic access to global state); concurrency (multiple agents or nodes execute tasks simultaneously); communication (information is exchanged through message passing); and fallibility (agents or nodes may produce errors or undergo faults).

Figure 2: Scalability. A comparison of LLM team scalability to Amdahl’s Law, which predicts theoretical speedup based on the proportion of serial dependencies in a task. Teams of agents were given preassigned tasks of three types (coding a math utilities library, creating a data analysis pipeline, and SVG rendering) and three dependency structures (parallel, mixed, or serial). Each trial type was repeated five times to account for variance in API latency, and efficiency was measured using wall-clock time in seconds. Speedup represents how much faster a team completed their task compared to the one-agent baseline. Highly parallel tasks generally benefited more from scaling team size than mixed or serial tasks, as predicted by Amdahl’s Law, although results depended on model type.

Fig 2: Figure 2: Scalability. A comparison of LLM team scalability to Amdahl’s Law, which predicts theoretical speedup based on the proportion of serial dependencies in a task. Teams of agents were given preassigned tasks of three types (coding a math utilities library, creating a data analysis pipeline, and SVG rendering) and three dependency structures (parallel, mixed, or serial). Each trial type was repeated five times to account for variance in API latency, and efficiency was measured using wall-clock time in seconds. Speedup represents how much faster a team completed their task compared to the one-agent baseline. Highly parallel tasks generally benefited more from scaling team size than mixed or serial tasks, as predicted by Amdahl’s Law, although results depended on model type.

Figure 3: Self-coordinating (decentralized) LLM teams. In Experiment 2, agents needed to not only complete tasks but also decide on assignments themselves. A. Scalability: Speedup is substantially lower for self-coordinating than preassigned teams due to consistency conflicts and communication overhead. This difference is especially stark for highly parallel tasks. B. Consistency conflicts: In self-coordinating teams, agents exhibit conflicts like writing to the same file simultaneously ( pink ), rewriting a file that another agent previously wrote ( yellow ), and attempting to complete a function before its dependencies have been finished ( brown ). These problems do not arise when tasks are preassigned by a central coordinator. C. Test failures: Failed test cases per round reveal that decentralized teams exhibit higher rates of intermediate failure due to these conflicts.

Fig 3: Figure 3: Self-coordinating (decentralized) LLM teams. In Experiment 2, agents needed to not only complete tasks but also decide on assignments themselves. A. Scalability: Speedup is substantially lower for self-coordinating than preassigned teams due to consistency conflicts and communication overhead. This difference is especially stark for highly parallel tasks. B. Consistency conflicts: In self-coordinating teams, agents exhibit conflicts like writing to the same file simultaneously ( pink ), rewriting a file that another agent previously wrote ( yellow ), and attempting to complete a function before its dependencies have been finished ( brown ). These problems do not arise when tasks are preassigned by a central coordinator. C. Test failures: Failed test cases per round reveal that decentralized teams exhibit higher rates of intermediate failure due to these conflicts.

Figure 4: Coordination overhead. Decentralized teams introduce greater coordination overhead, which worsens with more collaborators. A. Communication costs: Each line represents the difference in the number of messages sent when tasks are preassigned vs. decentralized. B. Idle costs: Each line represents the difference in agents remaining idle when tasks were preassigned versus decentralized. Importantly, these agents were still using tokens and sending messages; they just did not complete a task within an idle round.

Fig 4: Figure 4: Coordination overhead. Decentralized teams introduce greater coordination overhead, which worsens with more collaborators. A. Communication costs: Each line represents the difference in the number of messages sent when tasks are preassigned vs. decentralized. B. Idle costs: Each line represents the difference in agents remaining idle when tasks were preassigned versus decentralized. Importantly, these agents were still using tokens and sending messages; they just did not complete a task within an idle round.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Tasks in the experiments used prespecified dependency structures, so results may differ when dependencies must be discovered dynamically. Teams in the study were homogeneous (same base model); heterogeneous teams might behave differently and could improve accuracy in some settings. Communication here is natural language and probabilistic, so classical distributed protocols need adaptation rather than direct transplant A2A Protocol Pattern.

Methodology & More

Parallel speedups from assembling multiple language model agents track classic laws from distributed computing: when subtasks are independent, teams provide measurable speed gains; when subtasks are serial or tightly coupled, gains vanish. The study measured wall-clock completion time across coding tasks with three dependency patterns (parallel, mixed, serial) and compared one-agent baselines to N-agent teams. Even in highly parallel cases, measured speedups fell short of theoretical bounds (observed mean 2.19×), highlighting real coordination and overhead costs. Letting agents coordinate themselves made matters worse. Decentralized teams showed more conflicting edits (e.g., multiple agents writing the same file), more messages, more idle-but-spending agents, and higher intermediate failure rates than centrally preassigned teams. The work connects these behaviors to familiar distributed-systems concepts—independence, communication, concurrency, and fallibility—and recommends applying established tools like task scheduling, load balancing, redundancy, verification, and consensus-style checks to design more efficient, reliable multi-agent systems. Practitioners should weigh the task’s parallelizability before distributing work and instrument teams to detect stragglers and consistency conflicts. Consensus Evaluation

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Mixed author h‑indices (two ~12, others low), no strong institutional affiliations, arXiv preprint — recognized researchers but not top-tier signals.

multi-agent orchestration agent reliability multi-agent trust agent failure modes

Not sure where to start?