Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

Key Takeaway

Reliability in AI agents comes less from ever-larger models and more from moving burdens—memory, procedures, and interaction rules—into persistent, inspectable infrastructure that the model uses.

Key Findings

Externalizing state (memory), procedural know-how (skills), and interaction contracts (protocols) into a surrounding harness makes agents far more stable and governable than relying on model parameters alone. Memory Memory lets agents resume work and personalize behavior across sessions; skills turn ad hoc workflows into reusable procedures; protocols Model Context Protocol (MCP) Pattern enforce well-formed exchanges with tools, services, and other agents. A properly designed harness also adds permissioning, control limits, and observability so these external pieces work together and can be audited.

Data Highlights

13 main externalization dimensions: memory (state across time), skills (procedural expertise), and protocols (interaction contracts).
26 harness surfaces coordinate these modules: the three externalization modules plus permission, control, and observability.
34 memory types identified for agents: working context, episodic experience, semantic knowledge, and personalized memory.

What This Means

Engineers building production agents should use these patterns to reduce flaky behavior, recover from interruptions, and make systems auditable. Human-in-the-Loop Pattern Technical leaders and architects should treat agent capability as co-evolving between models and infrastructure when planning investments and governance. Researchers can use the taxonomy to prioritize empirical work on evaluation, distillation of experience into skills, and harness-level benchmarks Audit Trail.
Not sure where to start?Get personalized recommendations
Learn More

Key Figures

Figure 1 : Externalization as the organizing principle of LLM agent design. Upper panel: The arc of human cognitive externalization from thought through language, writing, printing, to digital computation. Middle panel: The corresponding externalization arc for LLM agents, from weights through three externalization dimensions—Memory (externalized state), Skills (externalized expertise), and Protocols (externalized interaction)—to the Harness that unifies them. Lower panel: A literature landscape mapping representative works onto three capability layers—Weights, Context, and Harness—illustrating how research threads have progressively migrated outward. The parallel between the two arcs encodes a recursive claim: LLM agents achieve reliable agency by externalizing cognitive burdens along the same representational dimensions that have driven human cognitive history.
Fig 1: Figure 1 : Externalization as the organizing principle of LLM agent design. Upper panel: The arc of human cognitive externalization from thought through language, writing, printing, to digital computation. Middle panel: The corresponding externalization arc for LLM agents, from weights through three externalization dimensions—Memory (externalized state), Skills (externalized expertise), and Protocols (externalized interaction)—to the Harness that unifies them. Lower panel: A literature landscape mapping representative works onto three capability layers—Weights, Context, and Harness—illustrating how research threads have progressively migrated outward. The parallel between the two arcs encodes a recursive claim: LLM agents achieve reliable agency by externalizing cognitive burdens along the same representational dimensions that have driven human cognitive history.
Figure 2 : Community theme evolution across three capability layers. The stacked layers—Weights, Context, and Harness—show how the center of gravity in the LLM agent community has shifted outward over time, from parametric knowledge and prompting toward harness-level infrastructure such as tool ecosystems, protocols, skills, and multi-agent orchestration.
Fig 2: Figure 2 : Community theme evolution across three capability layers. The stacked layers—Weights, Context, and Harness—show how the center of gravity in the LLM agent community has shifted outward over time, from parametric knowledge and prompting toward harness-level infrastructure such as tool ecosystems, protocols, skills, and multi-agent orchestration.
Figure 3 : Externalization architecture of a harnessed LLM agent. The Harness sits at the center; three externalization dimensions— Memory (working context, semantic knowledge, episodic experience, personalized memory), Skills (operational procedures, decision heuristics, normative constraints), and Protocols (agent–user, agent–agent, agent–tools)—orbit around it. Operational elements such as sandboxing, observability, compression, evaluation, approval loops, and sub-agent orchestration mediate the interaction between the harness core and the externalized modules.
Fig 3: Figure 3 : Externalization architecture of a harnessed LLM agent. The Harness sits at the center; three externalization dimensions— Memory (working context, semantic knowledge, episodic experience, personalized memory), Skills (operational procedures, decision heuristics, normative constraints), and Protocols (agent–user, agent–agent, agent–tools)—orbit around it. Operational elements such as sandboxing, observability, compression, evaluation, approval loops, and sub-agent orchestration mediate the interaction between the harness core and the externalized modules.
Figure 4 : Memory as externalized state. Raw context from the ephemeral context window and environment feedback is converted into four persistent memory dimensions—working context, episodic experience, semantic knowledge, and personalized memory. These dimensions are organized through progressively more managed architectures: monolithic context, retrieval stores, hierarchical orchestration (with extraction, consolidation, forgetting, and OS-style hot/cold swapping), and adaptive memory systems (with dynamic modules and feedback-based strategy optimization via MOE, RL, etc.). On the harness side, execution traces from skills and protocols flow into externalized memory, which in turn supplies task-relevant content back to the agent core through direct recall and curated snapshots.
Fig 4: Figure 4 : Memory as externalized state. Raw context from the ephemeral context window and environment feedback is converted into four persistent memory dimensions—working context, episodic experience, semantic knowledge, and personalized memory. These dimensions are organized through progressively more managed architectures: monolithic context, retrieval stores, hierarchical orchestration (with extraction, consolidation, forgetting, and OS-style hot/cold swapping), and adaptive memory systems (with dynamic modules and feedback-based strategy optimization via MOE, RL, etc.). On the harness side, execution traces from skills and protocols flow into externalized memory, which in turn supplies task-relevant content back to the agent core through direct recall and curated snapshots.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The paper is a unified conceptual review rather than an empirical benchmark suite, so concrete performance gains depend on implementation choices. Externalization introduces new costs and risks—storage, stale or unsafe artifacts, and attack surface for governance. The optimal split between model-internal capability and external infrastructure will shift as models improve and new use cases emerge Context Drift.

Full Analysis

The work frames recent advances in AI agents as a systematic process of externalization: moving selected cognitive burdens out of model weights into persistent, inspectable artifacts. It defines three core externalization modules—memory (to persist state and history), skills (to package reusable procedures and norms), and protocols (to enforce structured interaction)—and shows how a harness combines them with operational surfaces for permission, control, and observability. The review synthesizes literature and architectures (illustrated across eight figures) to explain why many reliability gains in practice stem from better infrastructure rather than larger models. Methodologically, the paper builds a conceptual taxonomy and maps representative works onto three capability layers—weights, context, and harness—tracing the community shift from model-centric solutions (2022+) toward infrastructure-heavy designs (2024–2026). It highlights cross-cutting flows (for example, how episodic memory can be distilled into reusable skills) and lists practical harness requirements: resumability, skill discovery and composition, protocol schemas, sandboxing, execution controls, and logging. The main implication is strategic: teams should invest in external systems (memory stores, skill registries, protocol contracts, and observability) to make agents predictable, auditable, and easier to govern, while monitoring the evolving boundary between what belongs inside the model and what should be externalized.
Explore evaluation patternsSee how to apply these findings
Learn More
Credibility Assessment:

Large author list with modest h‑indices (mostly <20); no strong top‑tier affiliations but multiple contributors suggest a recognized effort.