Make AI Skills Prove They’re Safe Before They Can Do Harm

Key Takeaway

Signed skill files are not proof they behave safely; require behavioral verification and keep humans in the loop for any irreversible action until a skill proves correct.

ON THIS PAGE

Key Findings

Signatures and policy checks alone are insufficient to trust a skill's runtime behavior. A compact trust schema with explicit verification levels lets a runtime decide whether to require human approval before a skill performs irreversible actions. An audit-driven correctness test (the biconditional criterion) ensures the runtime’s log and approvals actually match what happened. A reference implementation demonstrates these ideas with two lock-down modes and a small set of architectural rules operators can adopt. This approach aligns with concepts in the Agent Registry Pattern.

Data Highlights

14 verification levels defined (unverified, declared, tested, formal) that determine how much the runtime relaxes human gating

2100% of irreversible capability calls are routed to a human gate by default unless the skill’s verification level permits otherwise

32 runtime modes in the reference implementation (open and enclaved) to choose between dev flexibility and strict lockdown

What This Means

Engineers building agent runtimes and tool integrations should care because the schema gives a practical way to avoid stealthy prompt-injection and dangerous side-effects from third-party or self-generated skills. Security and operations teams benefit because the audit-driven gate turns vague confidence into verifiable records for incident review and safer automation. This aligns with principles from the Event-Driven Agent Pattern.

Not sure where to start?Get personalized recommendations

Learn More

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The schema assumes a trusted root for signatures; signer revocation and in-flight eviction are still tricky operational problems. The biconditional correctness check only audits actions that are logged and observable; covert side channels or external effects outside the corpus could evade it. Declassification rules and fine-grained label composition remain research questions that may affect usability in complex, compositional workflows. Design considerations also touch on cognitive and workflow aspects from the Tree of Thoughts Pattern and practical evaluation approaches from Evaluation-Driven Development (EDDOps).

Full Analysis

A compact, practical trust schema treats each skill as a signed artifact plus a manifest that includes a classification label, declared capabilities, signer identity, a monotone version counter, and a verification level (unverified, declared, tested, formal). Signatures bind identity and authorship but do not imply behavioral correctness. The runtime enforces a capability gate: every irreversible operation emitted while a low-verified skill is active is intercepted and sent to a human-in-the-loop (HITL) decision point. Higher verification levels give the runtime more latitude to let calls proceed without human approval. To turn these rules into an auditable safety test, the biconditional correctness criterion compares what the runtime says it approved and did against the actual state changes in a designated corpus after a run. The paper proposes an adversarial-ensemble evaluation (a set of realistic files and integrity baselines) to exercise worst-case, destructive intents. A reference implementation illustrates how to run in two fixed modes—open (developer-friendly) and enclaved (strict lockdown)—and extracts a set of architectural guidelines operators can adopt. Overall, the approach is low-cost to adopt and catches the kinds of skill-induced incidents operators already find in postmortems, while leaving open work on revocation, declassification, and compositional label handling. This is related to the Planning Pattern.

Explore evaluation patternsSee how to apply these findings

Learn More

Credibility Assessment:

Single-author arXiv preprint with no affiliations or citations and an unfamiliar author — limited identifiable credibility signals.

multi-agent trust agent governance agent-to-agent evaluation agent reliability

Not sure where to start?