How Hidden Skill Files Can Make AI Agents Do Dangerous Things

The Big Picture

Local skill files and helper artifacts can steer otherwise-benign AI agent tasks into unsafe behavior; in tests many systems had attack success rates similar to or higher than task success. Remember: task completion doesn’t mean the agent is safe.

ON THIS PAGE

The Evidence

Adversarial content placed in skill-facing materials (like helper scripts, local corpora, wrapper files, or memory entries) can change an agent’s behavior without changing the user-visible task. Built a runnable benchmark of 155 attacked cases from 47 real tasks and paired each case with a rule-based verifier to check concrete run artifacts. Evaluations show safety failures across multiple agent frameworks and model backends, and that vulnerability depends on how the agent scaffolded and what local artifacts it trusts. The benchmark demonstrates that safety testing must include the local, reusable artifacts agents load at runtime, not just isolated model outputs.

Data Highlights

1Median attack success rate (ASR) across evaluated agent-system/model pairings: 41.8%.

2Median task success rate across evaluated systems: 37.4%.

3Benchmark scale and scope: 155 adversarial cases drawn from 47 tasks, organized into 6 risk domains, 30 canonical categories, and 8 attack-class labels.

What This Means

Engineers building production agents should use these tests to find safety gaps introduced by reusable skills, helper files, or local data. Security and reliability teams benefit from runnable, artifact-grounded checks and verifiers that catch attacks which only appear during execution. [Product and platform managers] should treat skill artifacts as part of the trusted surface and add pre-production vetting and runtime monitoring.

Not sure where to start?Get personalized recommendations

Learn More

Key Figures

Fig 1: Figure 1: Problem-to-benchmark overview of SkillSafetyBench.

Fig 2: Figure 2: The construction pipeline of a specific case under the taxonomy of SkillSafetyBench.

Fig 3: Figure 3: An example case in RD3 from SkillSafetyBench.

Figure 4: Attack success versus task success across evaluated agent systems. Each point represents one CLI agent system–model backend pairing. The x-axis reports the task success rate, while the y-axis reports the overall attack success rate (ASR) on SkillSafetyBench. Dashed lines show the median task success rate (37.4%) and median ASR (41.8%) across evaluated systems.

Fig 4: Figure 4: Attack success versus task success across evaluated agent systems. Each point represents one CLI agent system–model backend pairing. The x-axis reports the task success rate, while the y-axis reports the overall attack success rate (ASR) on SkillSafetyBench. Dashed lines show the median task success rate (37.4%) and median ASR (41.8%) across evaluated systems.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The benchmark focuses on local, skill-mediated attack surfaces and does not exhaustively cover web-based or fully distributed multi-agent attacks. Verifiers are rule-based and human-validated, so subtle or novel harms that do not leave the expected artifacts could be missed. Experiments cover several strong models and CLI agent systems but may not generalize to every agent scaffold or future model design.

Methodology & More

SkillSafetyBench tests whether non-user local materials (skills, helper scripts, local corpora, wrappers, memory entries, and other runtime artifacts) can push an agent toward unsafe actions while the user’s task stays legitimate. The benchmark authors selected 47 executable tasks that naturally carry realistic attacks, then created 155 adversarial instances across six risk domains and 30 categories. Each case is runnable and paired with a case-specific rule-based verifier that inspects concrete run artifacts (files, logs, outputs) to decide whether the targeted unsafe behavior occurred; cases were validated by human review and an independent judgment protocol using a language model. Artifact-aware tests can help ensure checks cover the artifacts agents load at runtime beyond just model outputs."

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Affiliation with Peking University and an author with moderate h-index (~13) provide recognizable institutional and researcher signals despite many low-h authors; arXiv venue.

multi-agent trust agent reliability skill safety benchmark agent failure modes

Not sure where to start?