Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

A coordinated team of specialized AI agents can propose, implement, and validate real-device network fixes in an isolated emulator—cutting the need for constant expert hand-holding while avoiding risk to production. The multi-agent setup raised successful, safe mitigations to 46.7% versus 10.7% for a single-agent approach (4.4× improvement).

The Evidence

A multi-agent workflow splits the work into roles (suggest, implement, critique, judge, summarize) so AI can generate deployable firewall, routing, and host commands and test them safely in a high-fidelity lab running real vendor firmware. Every candidate mitigation is replay-tested by the same scripted adversary so defenders can measure whether the change actually reduces attacker progress while preserving LAN and internet connectivity. Role specialization and an iterative implementation–critique loop were the main reasons the multi-agent system performed far better than a single-agent baseline. Stacking approved mitigations uncovers compound effects and plateaued at a measurable reduction in attack progress after several fixes. Hierarchical Multi-Agent Pattern and Role-Based Agent Pattern.

Data Highlights

146.7% overall mitigation success rate (MSR) for the multi-agent workflow across 1,782 mitigation attempts, versus 10.7% for the single-agent baseline (4.4× improvement).
2Role specialization explains roughly 75% of the multi-agent improvement; the iterative implement–critique loop accounts for about 25% of the lift.
3Stacking approved mitigations on a persistent emulated network reached a 52% cumulative reduction in attacker progress by the fourth defense.

What This Means

Security engineers and incident response teams can use this approach to prototype and validate per-incident network fixes without risking production systems. Technical leaders evaluating AI-driven operations should care because the multi-agent design meaningfully improves reliable, testable outputs compared with a single monolithic agent. Researchers building agent workflows can borrow the role decomposition and replay-based validation pattern.
Not sure where to start?Get personalized recommendations
Learn More

Key Figures

Figure 1 : Multi-agent automatic mitigation framework showing the overall architecture and agent interactions.
Fig 1: Figure 1 : Multi-agent automatic mitigation framework showing the overall architecture and agent interactions.
Figure 2 : Single-agent baseline workflow for mitigation suggestion, implementation, and self-validation.
Fig 2: Figure 2 : Single-agent baseline workflow for mitigation suggestion, implementation, and self-validation.
Figure 3 : Evaluation workflow showing the parallel per-mitigation evaluations (independent, rolled-back) and the cumulative evaluation (persistent cumulative project with sequential mitigation replay). A demo video walking through this diagram can be found in Appendix 8.3 .
Fig 3: Figure 3 : Evaluation workflow showing the parallel per-mitigation evaluations (independent, rolled-back) and the cumulative evaluation (persistent cumulative project with sequential mitigation replay). A demo video walking through this diagram can be found in Appendix 8.3 .
Figure 4 : MSR by runtime condition, pooled across attacks and topologies. The hatched bar shows the rate without the connectivity-regression check.
Fig 4: Figure 4 : MSR by runtime condition, pooled across attacks and topologies. The hatched bar shows the rate without the connectivity-regression check.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Results apply to scripted, replayable adversaries and a lab with Linux endpoints and specific vendor firmware (FortiGate, Cisco IOS, Open vSwitch); adaptive attackers and Windows endpoints were not evaluated. All testing ran on an isolated emulator—deployment to production remains a human-reviewed step and is outside the system. The automation itself revealed new risks (a container-escape weakness during testing), so staging isolation and careful human oversight are required before any real deployment. [Defense in Depth Pattern].

Methodology & More

COHORT uses a team of specialized AI agents to automate the whole loop of deriving and validating network-level mitigations after a breach. One agent proposes candidate defenses (firewall rules, routing changes, host hardening); a second translates the chosen idea into concrete device commands under a bounded command budget; a critic iteratively reviews and corrects command text; a judge replays the original scripted adversary on the mitigated emulated network and compares step-by-step outcomes to a baseline; and a summarizer produces a human-readable artifact. Tests run in GNS3 with real vendor firmware and Ubuntu endpoints, and adversary behavior is replayed deterministically by a red-team-style emulator so comparison is attributable to the mitigation itself. A connectivity-regression check (LAN ping and an internet probe) rejects mitigations that break legitimate service, and approved mitigations can be cumulatively stacked to reveal compound effects. Defense in Depth Pattern A2A Protocol Pattern Emergence-Aware Monitoring Pattern. Across small, medium, and large enterprise topologies and four attack scenarios (data theft, ransomware, DNS exfiltration, lateral movement), the multi-agent pipeline achieved a 46.7% mitigation success rate versus 10.7% for a single-agent baseline. Ablation shows most of the gain comes from role specialization, with the iterative critique loop contributing additional reliability. Stacking mitigations produced diminishing returns and plateaued at about a 52% reduction in attacker progress by the fourth approved fix. Limitations include the scripted-adversary threat model, Linux-only endpoints, a small set of vendor firmware, and use of a single model family; final deployment remains a human decision and the framework introduces its own test-surface risks that must be managed.
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

ArXiv-only, no specified affiliations, and authors' provided h-indexes are low (<10). Signals point to emerging/limited information.