Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

In Brief

Having full, step-by-step execution logs plus a replayable environment makes it far easier to pinpoint which component caused a multi-agent system to fail — boosting agent-level attribution to about 66% and step-level localization dramatically.

Key Findings

Providing developers with complete execution narratives (inputs, intermediate messages, tool calls, metadata) plus a replayable environment substantially improves the ability to assign blame in multi-agent systems. With full traces, automated methods reach about 66% accuracy at identifying the failing component and about 30% at identifying the decisive failure step; adding replay further improves step-level localization. Finer-grained (step-level) attribution is much more sensitive to missing information than agent-level attribution, and performance varies across different system architectures and agent roles.

Data Highlights

165.9% average agent-level attribution accuracy when full execution traces are available (TraceElephant benchmark).
230.3% average step-level attribution accuracy with full traces — a 76% relative improvement over output-only logs.
3Agent-level accuracy improves by ~22% with full observability vs output-only; providing a replayable environment adds another ~10% to step-level accuracy.

Implications

Engineers building multi-agent systems and platform teams responsible for reliability should care because richer logs and instrumentation enable faster and more precise debugging. Tool and observability designers can use these findings to prioritize instrumentation (inputs, tool calls, intermediate messages) that most helps developers assign responsibility and fix failures.
Explore evaluation patternsSee how to apply these findings
Learn More

Key Figures

Figure 1: A failure case (from Who&When benchmark) illustrating the limitation of partial observability. When only agent outputs are visible and critical inputs are absent, localizing the decisive failure step becomes difficult.
Fig 1: Figure 1: A failure case (from Who&When benchmark) illustrating the limitation of partial observability. When only agent outputs are visible and critical inputs are absent, localizing the decisive failure step becomes difficult.
Figure 2: Overview of TraceElephant.
Fig 2: Figure 2: Overview of TraceElephant.
Figure 3: Comparison under different backbone LLMs.
Fig 3: Figure 3: Comparison under different backbone LLMs.
Figure 4: Distribution of failure agent in TraceElephant.
Fig 4: Figure 4: Distribution of failure agent in TraceElephant.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The benchmark and results assume a developer-facing setting where full traces and re-execution are available; they do not directly apply to black‑box or restricted-deployment scenarios. The dataset covers 220 failure traces from three representative systems, so findings may not cover all architectures or future designs. The notion of the “decisive” failure step is role-aware (it attributes failures to the module expected to catch or fix upstream errors), which may differ from purely chronological error definitions. re-execution may be necessary to probe certain hypotheses.

Methodology & More

TraceElephant is a developer-facing benchmark built to measure how well methods can attribute failures in systems made of multiple language-based agents or modular components. Each of the 220 instances includes a full, step-by-step execution trace (step-by-step execution trace) (inputs, outputs, tool calls, agent configs, and metadata) and a reproducible environment so runs can be replayed or probed with “what if” counterfactuals. Responsible components and the decisive failure step are annotated under a role-aware definition: a failure is attributed to the component that had the responsibility to detect or fix an earlier mistake, not merely the first visible error. replay capability. Using TraceElephant, the study compares several attribution approaches — prompt-based LLM queries and lightweight agentic procedures that can re-run parts of the trace — under two observability modes: static (full trace) and dynamic (full trace plus replay). Results show clear benefits from full observability: agent-level attribution rises to about 66% (a ~22% jump over output-only logs) and step-level attribution rises to about 30% (a ~76% jump). Allowing re-execution of the system yields another ~10% gain in step-level accuracy. Practical takeaways are straightforward: capture inputs and intermediate states, provide replay where possible, and design agents and verification roles so errors remain recoverable or clearly attributable. The work also highlights the need for architecture-aware attribution techniques and more diverse benchmarks to cover other deployment scenarios.
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

Multiple authors with moderate h-indexes (up to ~15) indicate competent researchers, but affiliations are unspecified and the paper is only on arXiv with no citations, so credibility is solid but not top-tier.