Stop Rogue Agents: Let Real Behavior, Not Their Blurbs, Decide Who Answers

At a Glance

Choose agents based on measured behavior rather than their written descriptions to eliminate a common attack that tricks routers—reducing description-based hijacks to near zero while improving accuracy and speed.

ON THIS PAGE

What They Found

Mapping each agent’s real performance on trusted test queries into a fixed numerical signature lets the router pick the best agent without ever reading the agent’s natural-language description. That change blocks attackers who hide malicious instructions inside agent descriptions—attack success dropped from a majority for text-based routers to near-zero for the behavior-driven router. The method also makes routing faster and more accurate because selection becomes a single, cheap matrix-vector computation instead of a long text-based reasoning step. The defense assumes a trusted offline calibration pipeline and an unpoisoned benchmark dataset—if those are compromised the geometric operators can be poisoned. The method depends on a fixed embedding function, so vulnerabilities in the embedding stage remain a concern. Using a strict binary success/failure label simplifies separation but may miss nuanced trade-offs in complex multi-step workflows; richer scoring could be needed for more subtle routing goals. Defense in Depth Pattern

Need expert guidance?We can help implement this

Learn More

Key Data

1Attack success rate on a standard benchmark fell to 0.2%–2.4% for the behavior-based router (ANTAP) on MMLU-type tasks.

2A description-based baseline (AutoGen-style) had an attack success rate above 73% on comparable tests.

3Operator training is stable and efficient: registration reduces to an O(N) projection step and uses a regularization setting (λ) stable across 0.1–10.0, with λ=1.0 as the default.

Implications

Platform engineers and security teams running multi-agent systems should care because this approach removes a privileged text-based attack surface and makes routing decisions auditable and reproducible. Developers building agent marketplaces or service registries can use behavior-based registration to align incentives away from polishing descriptions and toward measurable competence. Emergence-Aware Monitoring Pattern

Key Figures

Figure 2 : Visualization of the ANTAP decision surface. This projection contrasts the router’s geometric logic against empirical ground truth for the Malicious agent. Green and red points represent actual successes and failures on the test set, respectively. The background regions visualize the router’s predicted competence zones, projected onto a hybrid discriminant space (LDA/PCA).

Fig 2: Figure 2 : Visualization of the ANTAP decision surface. This projection contrasts the router’s geometric logic against empirical ground truth for the Malicious agent. Green and red points represent actual successes and failures on the test set, respectively. The background regions visualize the router’s predicted competence zones, projected onto a hybrid discriminant space (LDA/PCA).

Figure 3 : Adaptive attack ASR as a function of the trigger length (1-16 tokens) for both ANTAP (blue) and EmbedLLM (red), including std along 7 runs (seeds 0-6)

Fig 3: Figure 3 : Adaptive attack ASR as a function of the trigger length (1-16 tokens) for both ANTAP (blue) and EmbedLLM (red), including std along 7 runs (seeds 0-6)

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The defense assumes a trusted offline calibration pipeline and an unpoisoned benchmark dataset—if those are compromised the geometric operators can be poisoned. The method depends on a fixed embedding function, so vulnerabilities in the embedding stage remain a concern. Using a strict binary success/failure label simplifies separation but may miss nuanced trade-offs in complex multi-step workflows; richer scoring could be needed for more subtle routing goals. Memory Poisoning

Methodology & More

Agents are normally registered with human-written descriptions and routers pick who should handle a user query by reading those texts. Attackers can slip malicious instructions into these descriptions and thereby hijack the router without ever touching the router’s code. Instead of parsing descriptions at inference time, register each agent by running it on a set of trusted benchmark queries and record whether it succeeds or fails. Turn those results into a numeric operator (a fixed matrix) that summarizes the agent’s behavioral signature in embedding space. At runtime, embed the user query once and score all agents by a single matrix-vector product to pick the best agent—no natural-language ingestion required. Model Context Protocol (MCP) Pattern Because the router only sees precomputed numeric operators and query embeddings at inference, descriptive prompt injections become mathematically inexpressible: a malicious sentence in an agent’s description cannot change the numeric operator used by the router. The paper shows this approach (ANTAP) reduces attack success rates from over 73% for description-based routing to about 0.2%–2.4% on common benchmarks, while also improving routing latency and accuracy. The method also captures sleeper/backdoor behavior during the offline benchmark: if an agent misbehaves on trigger-containing calibration queries it will be encoded as a failure and suppressed by the operator. The trade-offs are practical: you must protect the offline evaluation pipeline and decide whether binary success labels are sufficient for your routing goals, but for many deployments this gives a faster, auditable, and far more secure way to route requests in multi-agent systems. A2A Protocol Pattern

Need expert guidance?We can help implement this

Learn More

Credibility Assessment:

ArXiv preprint with no affiliations and all authors showing low h-indexes (~3). Fits emerging/limited-info category.

multi-agent trust agent track record agent reliability agent governance

Not sure where to start?