Stop Agent Teams from Getting Stuck: Automatically Find and Fix Coordination Bugs

At a Glance

Counterexample-driven model checking turns rare coordination bugs into concrete fixes: every task in the benchmark was verified after at most four repair iterations, and verified protocols greatly reduce runtime failures.

ON THIS PAGE

What They Found

TraceFix converts a natural-language task into a checked coordination protocol, uses exhaustive model checking to produce minimal failing traces, and repairs the protocol until no safety violation remains under chosen bounds. Many failures come from message passing patterns (deadlocks, mismatched handshakes) rather than simple lock misuse; counterexamples give precise fixes so repairs converge quickly. Verified protocols, combined with a runtime topology monitor that rejects out-of-protocol operations, substantially improve task completion under real-world faults. deadlocks

Key Data

1100% of the 48 benchmark tasks reached verification after at most four repair iterations.

262.5% passed verification on the first attempt; the other 37.5% required 1–4 counterexample-driven repairs (most fixed in one iteration).

3Runtime with topology-aware monitoring achieved 89.4% average completion and 81.5% full completion, compared with only 29.3% full completion for a chat-only baseline.

Implications

Engineers building systems where multiple language agents interact should care because coordination, not individual agent ability, is a dominant source of failures and can be fixed systematically. Platform owners and technical leads can use this workflow to catch rare interleaving bugs before deployment and raise overall multi-agent trust and reliability.

Need expert guidance?We can help implement this

Learn More

Key Figures

Figure 1. TraceFix pipeline overview. At design time (Stages 1–4), an orchestration agent synthesizes a protocol topology IR, generates PlusCal coordination logic, and iteratively repairs the protocol using TLC counterexamples until verification succeeds. At runtime (Stages 5–6), verified process bodies are compiled into per-agent prompts and executed under a topology monitor that rejects out-of-protocol coordination operations.

Fig 1: Figure 1. TraceFix pipeline overview. At design time (Stages 1–4), an orchestration agent synthesizes a protocol topology IR, generates PlusCal coordination logic, and iteratively repairs the protocol using TLC counterexamples until verification succeeds. At runtime (Stages 5–6), verified process bodies are compiled into per-agent prompts and executed under a topology monitor that rejects out-of-protocol coordination operations.

Figure 2. Root-cause distribution of the 29 repair attempts across all 48 tasks. Deadlocks dominate (20 attempts, 69.0%), followed by TLC state-space timeouts (5), PlusCal syntax errors (3), and undrained channels (1). The dominant failures are structural coordination hazards (nondeterministic receives, early hub termination) rather than fundamental protocol design flaws.

Fig 2: Figure 2. Root-cause distribution of the 29 repair attempts across all 48 tasks. Deadlocks dominate (20 attempts, 69.0%), followed by TLC state-space timeouts (5), PlusCal syntax errors (3), and undrained channels (1). The dominant failures are structural coordination hazards (nondeterministic receives, early hub termination) rather than fundamental protocol design flaws.

Figure 3. Distribution of repair-requiring tasks (18 of 48) by repair iteration count and difficulty tier. Most tasks need a single repair; only one task (10H, Parallel Build Hard) requires 4 iterations. Hard tasks dominate (12 of 18) but 2 Easy and 4 Medium tasks also need repair, indicating that topology structure matters alongside the coarse difficulty label.

Fig 3: Figure 3. Distribution of repair-requiring tasks (18 of 48) by repair iteration count and difficulty tier. Most tasks need a single repair; only one task (10H, Parallel Build Hard) requires 4 iterations. Hard tasks dominate (12 of 18) but 2 Easy and 4 Medium tasks also need repair, indicating that topology structure matters alongside the coarse difficulty label.

Figure 4. Cumulative TLC verification pass rate as a function of repair iteration, stratified by difficulty. 62.5% of tasks require zero repairs; the curve reaches 100% by iteration 4, with Easy and Medium converging by iterations 1–3. Takeaway; the generate/check/repair loop is both necessary (37.5% fail initially) and fast-converging; counterexample traces provide sufficient signal to close all remaining violations within a small iteration budget.

Fig 4: Figure 4. Cumulative TLC verification pass rate as a function of repair iteration, stratified by difficulty. 62.5% of tasks require zero repairs; the curve reaches 100% by iteration 4, with Easy and Medium converging by iterations 1–3. Takeaway; the generate/check/repair loop is both necessary (37.5% fail initially) and fast-converging; counterexample traces provide sufficient signal to close all remaining violations within a small iteration budget.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Verification is bounded: checks exhaustively cover the modeled limits (bounded queues, counters, agents) but do not guarantee behavior outside those bounds. The runtime monitor enforces topology conformance (who can talk to whom and who holds locks) but does not enforce fine-grained step ordering, so agents could still deviate in ways not caught at runtime. When the model checker times out on very complex cases, simplifying the model can speed verification but may weaken fidelity to the original coordination intent, producing semantic drift. A2A Protocol Pattern Supervisor Pattern

Methodology & More

TraceFix treats a proposed coordination protocol as a hypothesis and attacks it with a model checker to surface concrete failing schedules. The pipeline: (1) synthesize a protocol topology from a natural-language task, (2) compile that topology into a step-by-step, model-checkable specification, (3) run exhaustive safety checks under bounded assumptions to find minimal counterexample traces, and (4) use those traces to drive automated repairs. Repeat until no safety violation is found within bounds. After verification, per-agent process bodies are compiled into prompts and a runtime topology monitor enforces the declared channels and locks. A2A Protocol Pattern Supervisor Pattern Dynamic Task Routing Pattern Multi-Agent Content Creation On a 48-task benchmark spanning domains like planning, research writing, and manufacturing, the loop proved both necessary and efficient: 37.5% of tasks failed the first check but all converged within four repair iterations, with deadlocks responsible for 69% of repair attempts. Message-passing patterns (conditional receives, retry loops, early coordinator termination) caused most failures; simple lock ordering rarely failed. At runtime, verified protocols plus topology monitoring raised average completion to 89.4% and kept full completion high (81.5%), while unverified prompt-only approaches suffered large failure rates. Key trade-offs remain: verification focuses on safety properties (deadlock freedom, mutual exclusion, drained channels) under bounds and does not prove liveness or semantic correctness of task outputs, and enforcement at runtime is intentionally lightweight to be practical, which leaves a verification-to-enforcement gap for future work. Dynamic Task Routing Pattern Role-Based Agent Pattern Event-Driven Agent Pattern A2A Protocol Pattern Multi-Agent Content Creation

Need expert guidance?We can help implement this

Learn More

Credibility Assessment:

All authors have very low h-indexes, no institutional affiliations provided, arXiv preprint with no citations — limited identifiable reputation.

multi-agent trust multi-agent orchestration agent reliability agent failure modes

Not sure where to start?