A Tool That Finds Why Teams of AI Agents Break (and How to Watch Them)

Key Takeaway

Groups of AI agents are much less safe than individual models: automated tests across 300 workflows show an average safety pass rate of 7.1%, with system-level risks passing only 1.3%.

ON THIS PAGE

What They Found

TrinityGuard is a unified, plug-in framework that tests and monitors safety across agent teams by treating risks at three levels: single agents, agent-to-agent communication, and whole-system emergence. It defines 20 concrete risk types and runs curated attacks plus adaptive probes to produce per-agent, per-channel, and per-trajectory vulnerability reports. Applied to 300 synthesized workflows and several real case studies, it reveals severe fragility—especially at the system level—while offering both pre-deployment testing and live monitoring to catch problems early. per-agent, per-channel, and per-trajectory vulnerability reports

Explore evaluation patternsSee how to apply these findings

Learn More

Key Data

1Tested 300 synthesized multi-agent workflows; overall safety pass rate was 7.1%.

2System-level (tier 3) risks averaged only a 1.3% pass rate; database and research domains had 0% pass on system-level checks.

3Communication risks (tier 2) averaged 13.2% pass; single-agent risks (tier 1) averaged 6.8% pass.

Why It Matters

Engineers building multi-agent workflows and platform teams should care because TrinityGuard helps find exactly where a system is vulnerable—agent, channel, or whole-system—and gives actionable reports. Security, reliability, and product leads can use the framework for pre-deploy penetration-style testing and for continuous runtime monitoring to catch emergent failures before users do.

Key Figures

Fig 1: Figure 1 : The overall architecture design of TrinityGuard.

Fig 2: Figure 2 : Summary of three-tier risks studied by TrinityGuard.

Fig 3: Figure 3 : Fine-grained distribution of detected safety risks across the three TrinityGuard tiers for 300 synthesized MAS workflows.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Keep in Mind

Results come from a mix of 300 automatically synthesized workflows and a set of representative examples, so real-world pass rates may differ with custom architectures and stronger hardening. The framework relies on language models as semantic judges, which can misclassify subtle outcomes and produce false positives or negatives. Integrating TrinityGuard requires adapting a BaseMAS adapter and may add operational cost for continuous monitoring, though it’s designed to be extensible rather than fully automatic remediation. continuous monitoring

Deep Dive

TrinityGuard organizes safety work into three practical layers. The bottom layer is a lightweight adapter that abstracts different agent orchestration frameworks so tests and monitors stay framework-agnostic. The middle layer offers intervention and observation primitives (for example: inject a message, spoof an identity, poison memory, or stream typed events). The top layer contains 20 risk-specific test modules and paired runtime monitor agents; both modes use a centralized "judge" mechanism based on configurable prompts to decide pass/fail for each test. The framework groups risks into three tiers: single-agent (prompt injection, jailbreaking, hallucination, tool misuse, etc.), inter-agent communication (malicious propagation, misinformation amplification, identity spoofing, goal drift, etc.), and system-level emergent risks (cascading failures, group hallucination, sandbox escape, rogue agents). Running TrinityGuard across 300 synthetic workflows and several real examples exposed widespread vulnerabilities: an average pass rate of 7.1%, with system-level issues almost entirely unprotected (1.3% pass). Some architectures that enforce reviewer patterns or strong grounding showed specific resilience (for example, a research-style reviewer loop reduced hallucination), but basic problems like prompt injection and unauthorized code execution often scored 0% pass. The takeaway for practitioners is to combine pre-production adversarial testing with live, event-driven monitoring, focus fixes where TrinityGuard points (agent vs channel vs system), and invest in hardened identity, output validation, and strict tool permissions to raise real-world resilience. intervention and observation primitives three practical layers judge mechanism

Not sure where to start?Get personalized recommendations

Learn More

Credibility Assessment:

All authors have low h-indices and affiliation is a regional university (Southwest Medical University); arXiv preprint and no citations — modest credibility.

multi-agent trust agent-to-agent evaluation continuous agent evaluation agent reliability

Not sure where to start?