The Big Picture
Multi-agent LLM systems often drop in quality under messy inputs, but their degraded responses can still reveal structured signals that an adaptive system could learn from.
ON THIS PAGE
The Evidence
When testing five multi-agent coordination styles on a banking-risk task, average judged quality fell by roughly one third under controlled semantic stress. Even so, every architecture produced a measurable expansion in its effective stress distribution — a statistical signature the authors call a positive distributional Jensen Gap. That expansion means stress created patterned variation (not random collapse) that a future adaptive mechanism could potentially exploit to improve behavior.
Test your agentsValidate against real scenarios
Key Data
1Mean judged quality fell by about 33% across architectures under perturbation (clean vs. stressed prompts).
2Evaluation used 50 clean prompts and 10 perturbed variants each (500 stressed prompts total) across four stress dimensions.
3All five architectures showed positive distributional Jensen Gaps with 95% bootstrap confidence intervals above zero, indicating convex-expansive deformation of observed stress.
Implications
Engineers building multi-agent pipelines and technical leads deciding where to invest in adaptive controllers: CAFE tells you whether stress is worth learning from before you build complex adaptation. Researchers developing routing, memory, or stress-aware selection mechanisms can use CAFE to find domains and architectures where adaptation may pay off.
Key Figures

Fig 1: Figure 1: Agentic architectures evaluated in CAFE.

Fig 2: Figure 2: Distributional Jensen Gap by architecture. Error bars denote bootstrap 95 % 95\% confidence intervals, and dashed lines mark the resilience tolerance.

Fig 3: Figure 3: Representative expected-to-observed marginal deformations. These examples show how different coordination mechanisms expose different stress modes: adversarial debate expands conflict, meta-adaptive control expands drift, and ensemble consensus expands load.

Fig 4: Figure 4: Marginal stress deformation diagnostics for A0 Flat.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreYes, But...
A positive gap is a detection of useful variation, not proof that an architecture will perform better after adaptation. Experiments used a controlled banking-risk benchmark and four specific judge signals; results may differ in other domains or with different evaluators. CAFE itself does not implement adaptation — it only flags where building adaptive mechanisms could be promising, such as adaptive mechanisms.
Methodology & More
CAFE is a measurement framework that asks a focused question: does semantic stress make a multi-agent system’s behavior more informative in a way that adaptation could exploit? Instead of expecting immediate performance gains, CAFE measures whether the system’s observed effective stress distribution expands relative to the expected stress inputs along four semantic axes: conflict, load, ambiguity, and temporal drift. The method runs many perturbed prompts, records a vector of judge signals (coherence, grounded novel inference, contradiction resolution, structural preservation), fits a multi-output polynomial response model to map judge outputs back to effective stress, and compares expected vs observed stress with a distributional Jensen Gap computed under a convex stress potential. A positive gap means the observed distribution is convex-expansive — in plain terms, stress produces structured, learnable variation rather than just noise or collapse. The five representative coordination architectures on a banking-risk assessment task — Market-Based Coordination Pattern among others — showed a mixed picture. The practical takeaway is that even when answers get worse, stressed multi-agent systems can expose diagnostic patterns that a downstream adaptive controller — for example, a stress-aware router, memory of failure modes, or targeted retraining — could use to improve future performance. The next step is building those adaptive layers and validating whether positive gaps translate into concrete gains in calibration, factuality, or task accuracy.
Test your agentsValidate against real scenarios
Credibility Assessment:
All authors have low reported h-indexes (max 2), no affiliations given, arXiv preprint with no citations — limited signals of established reputation.