Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

At a Glance

Maintaining a skill library as a living software asset sharply improves agent success and cuts runtime cost: a library-maintenance layer raised task success to 79.5% and runs with nearly zero extra language-model calls.

Key Findings

Treat skill sets as managed ecosystems rather than static retrieval pools. Representing each skill with a simple contract (preconditions, operation, artifact, validator, failure modes) and connecting skills into a graph lets a maintenance loop diagnose redundancy, missing validators, and incompatibilities, then apply typed fixes like merge, repair, or retire. Applied before agents run, this cleaning step boosts task success (especially for retrieval-based agents), scales stably to larger noisy libraries, and keeps extra runtime token or call cost near zero. Capability Discovery Pattern to manage skills.

Data Highlights

179.5% task success rate on ALFWorld when SkillOps acts as the standalone planner
2+8.9 percentage points better than the strongest baseline (LLM_Skill_Planner) at the 200-skill scale
3Retrieval-heavy agents gain between +0.68 and +2.90 percentage points from the maintenance pass; library-time maintenance uses nearly zero extra language-model calls

Implications

Engineers building multi-step AI agents: use a library maintainer to reduce repeated runtime failures and avoid fixing the same bug over and over. Platform and ML engineering leads: treating skill collections as managed assets lowers downstream risk and runtime cost, improving agent reliability. Researchers evaluating agent orchestration or agent-to-agent interactions: the paper shows maintenance is a distinct, impactful layer worth measuring separately from planning and retrieval. Multi-Agent Knowledge Management
Explore evaluation patternsSee how to apply these findings
Learn More

Key Figures

Figure 1 : SkillOps System Architecture. The Hierarchical Skill Ecosystem Graph (HSEG) comprises two levels: (1) an Internal Skill Graph that models each skill as a contract graph over Precondition ( P P ), Operation ( O O ), Artifact ( A A ), Validator ( V V ), and Failure Mode ( F F ) nodes; and (2) an External Graph-of-Graphs connecting skills via typed dependency ( dep ), compatibility ( comp ), redundancy ( red ), and alternative ( alt ) edges. Two alternating loops govern agent operation: the Task-Time Loop (left, blue) retrieves candidate skill subgraphs, verifies interface compatibility, inserts adapter/validator nodes as needed, and executes the assembled subgraph with local repair on failure; the Library-Time Loop (right, orange) mines skill contracts from execution logs, diagnoses library health across five dimensions (utility, redundancy, compatibility, failure-risk, validation-gap), and applies maintenance actions ( merge , repair , retire , add_validator , add_adapter , instantiate ) to keep the ecosystem sound.
Fig 1: Figure 1 : SkillOps System Architecture. The Hierarchical Skill Ecosystem Graph (HSEG) comprises two levels: (1) an Internal Skill Graph that models each skill as a contract graph over Precondition ( P P ), Operation ( O O ), Artifact ( A A ), Validator ( V V ), and Failure Mode ( F F ) nodes; and (2) an External Graph-of-Graphs connecting skills via typed dependency ( dep ), compatibility ( comp ), redundancy ( red ), and alternative ( alt ) edges. Two alternating loops govern agent operation: the Task-Time Loop (left, blue) retrieves candidate skill subgraphs, verifies interface compatibility, inserts adapter/validator nodes as needed, and executes the assembled subgraph with local repair on failure; the Library-Time Loop (right, orange) mines skill contracts from execution logs, diagnoses library health across five dimensions (utility, redundancy, compatibility, failure-risk, validation-gap), and applies maintenance actions ( merge , repair , retire , add_validator , add_adapter , instantiate ) to keep the ecosystem sound.
Figure 3 : Maintenance cost summary. The library-time maintenance pass uses nearly zero LLM calls at all scales, while task-time token changes are mostly neutral or negative.
Fig 3: Figure 3 : Maintenance cost summary. The library-time maintenance pass uses nearly zero LLM calls at all scales, while task-time token changes are mostly neutral or negative.
Figure 4 : Noise-graded library scaling. SkillOps remains stable as the library grows from 200 to 2000 skills, while retrieval-heavy baselines degrade under increasing noise.
Fig 4: Figure 4 : Noise-graded library scaling. SkillOps remains stable as the library grows from 200 to 2000 skills, while retrieval-heavy baselines degrade under increasing noise.
Figure 5 : Per-task-type SR. Results are reported for the 200-skill library, pooled over 3 seeds.
Fig 5: Figure 5 : Per-task-type SR. Results are reported for the 200-skill library, pooled over 3 seeds.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

Requires structured skill contracts and, in parts of the evaluation, gold-standard task arguments that may not exist in many real deployments. Experiments use ALFWorld and a partly synthetic library, so results may vary on other domains or real long-running logs. The current rule-based maintenance can miss deep semantic redundancy or complex conflicts that need richer reasoning and may sometimes conflict with agents that self-repair at task time. This risk echoes the challenges addressed by Guardrails Pattern.

The Details

SkillOps treats a skill library as a small software ecosystem: every callable skill is written as a contract listing when it can run, what it does, what it outputs, how to validate outputs, and known failure modes. Those contracts are linked into a Hierarchical Skill Ecosystem Graph that records typed edges such as dependencies, compatibility, and redundancy. A library-time maintenance loop inspects observable signals (usage logs, code-hash collisions, missing validators, type mismatches), scores library health along utility, redundancy, compatibility, failure-risk, and validation-gap axes, and applies typed actions like merge, repair, retire, add_validator, or add_adapter to produce a cleaner library that downstream agents can use without changing their planners. Blackboard Pattern Model Context Protocol (MCP) Pattern evaluates on the ALFWorld benchmark with libraries scaled from 200 to 2000 skills (including realistic degradations), SkillOps as a standalone planner reached 79.5% success and beat a strong LLM-based planner by 8.9 percentage points. When used as a plug-in maintenance layer, it improved retrieval-heavy baselines by +0.68 to +2.90 percentage points and stayed stable as noise and library size increased. The rule-driven maintenance pass adds almost no extra language-model calls or runtime tokens, making maintenance a low-overhead architectural layer. Limitations include dependency on structured metadata, partial synthetic evaluation data, and the rule-based pass missing some semantic issues; richer, model-driven maintenance could catch those at higher cost.
Need expert guidance?We can help implement this
Learn More
Credibility Assessment:

Authors have very low h-indices and no institutional affiliations or strong venue — limited credibility signals.