Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

The Big Picture

Treating memory management as a learnable skill and automating its improvement yields large gains: a 32B open model improved 2–4× on long-horizon games without changing its task weights.

Key Findings

Treat memory like a skill the model can practice: let the agent read, write, search, and organize files as first-class actions. A higher-capacity reviewer model inspects full multi‑thousand-step episodes and automatically (1) revises the agent’s memory scaffold (prompts, file schema, code) and (2) selects and curates memory actions to train a dedicated memory specialist. Those two automated loops dramatically reduce wasted behavior and redundant memory activity, producing 2–4× gains across three procedurally generated long-horizon games and bringing a 32B open model up to or beyond much larger proprietary systems on these tasks. Event-Driven Agent Pattern

By the Numbers

1End-to-end performance gains on three long-horizon games of roughly 2× to 4× after scaffold optimization and memory training.
2Unproductive game action rate (stuck or oscillating) dropped by 32–65% after scaffold improvements.
3Redundant memory writes fell by 68–83%, while empty memory searches and per-step context size also fell substantially.

Why It Matters

Engineers building long-running AI agents and systems that need to manage lots of state (game AIs, simulation controllers, task planners) can use these ideas to improve performance without increasing model size. Technical leaders evaluating trade-offs between larger models and smarter tooling can treat memory management as a high-leverage optimization that narrows the gap to bigger proprietary models. Orchestrator-Worker Pattern
Test your agentsValidate against real scenarios
Learn More

Key Figures

Figure 1: Memory skill optimization with Qwen2.5-32B-Instruct . Starting from a base agent equipped with file-system memory (v0), AutoMem progressively improves performance through memory scaffold optimization (v0–v5/v4/v2), followed by memory proficiency training (+train) that yields further gains on top of the optimized scaffold.
Fig 1: Figure 1: Memory skill optimization with Qwen2.5-32B-Instruct . Starting from a base agent equipped with file-system memory (v0), AutoMem progressively improves performance through memory scaffold optimization (v0–v5/v4/v2), followed by memory proficiency training (+train) that yields further gains on top of the optimized scaffold.
Figure 2: Long-horizon game environments for evaluating memory skills. All three environments are stochastic worlds, making each episode unique and minimizing the influence of pretraining knowledge. Crafter is an open-world survival game with crafting, combat, and resource management. MiniHack presents focused puzzle, navigation and combat tasks within the NetHack engine. NetHack is among the most complex games: episodes span 10 4 10^{4} – 10 5 10^{5} turns with a vast exploration space, taking human players typically years to master.
Fig 2: Figure 2: Long-horizon game environments for evaluating memory skills. All three environments are stochastic worlds, making each episode unique and minimizing the influence of pretraining knowledge. Crafter is an open-world survival game with crafting, combat, and resource management. MiniHack presents focused puzzle, navigation and combat tasks within the NetHack engine. NetHack is among the most complex games: episodes span 10 4 10^{4} – 10 5 10^{5} turns with a vast exploration space, taking human players typically years to master.
Figure 3: Overview of AutoMem . Two automated outer loops optimize a shared inner-loop agent that uses the file system as its memory. Outer-loop #1 (top): a meta-LLM reviews full episode traces and iteratively revises the agent scaffold. Outer-loop #2 (bottom): a meta-LLM training engine jointly orchestrates data curation and finetuning configuration to train a dedicated memory specialist that handles memory operations, while the task model (frozen, unmodified) commits task actions. The two loops are complementary: loop #1 produces an optimized scaffold within which loop #2 trains the model to interact with its memory more effectively.
Fig 3: Figure 3: Overview of AutoMem . Two automated outer loops optimize a shared inner-loop agent that uses the file system as its memory. Outer-loop #1 (top): a meta-LLM reviews full episode traces and iteratively revises the agent scaffold. Outer-loop #2 (bottom): a meta-LLM training engine jointly orchestrates data curation and finetuning configuration to train a dedicated memory specialist that handles memory operations, while the task model (frozen, unmodified) commits task actions. The two loops are complementary: loop #1 produces an optimized scaffold within which loop #2 trains the model to interact with its memory more effectively.
Figure 4: Effect of scaffold optimization on gameplay and memory behavior (v0 → \to v5 for Crafter, v0 → \to v4 for MiniHack, v0 → \to v2 for NetHack). Left: the unproductive game action rate (fraction of steps that are either stuck or oscillating) drops 32–65% across all three environments. Right three panels: memory operations become more efficient and better targeted —redundant writes drop sharply ( − - 68 to − - 83%), the empty-search rate (memory SEARCH es returning nothing) falls ( − - 13 to − - 50%), and per-step input context shrinks ( − - 3 to − - 30%) as leaner memory compresses what the model must attend to. All values are v0 vs. final scaffold version; lower is better in every panel.
Fig 4: Figure 4: Effect of scaffold optimization on gameplay and memory behavior (v0 → \to v5 for Crafter, v0 → \to v4 for MiniHack, v0 → \to v2 for NetHack). Left: the unproductive game action rate (fraction of steps that are either stuck or oscillating) drops 32–65% across all three environments. Right three panels: memory operations become more efficient and better targeted —redundant writes drop sharply ( − - 68 to − - 83%), the empty-search rate (memory SEARCH es returning nothing) falls ( − - 13 to − - 50%), and per-step input context shrinks ( − - 3 to − - 30%) as leaner memory compresses what the model must attend to. All values are v0 vs. final scaffold version; lower is better in every panel.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The study focuses on episodic memory where the file system resets each episode; persistent cross-episode memory was not tested. Evaluations used procedurally generated games (Crafter, MiniHack, NetHack), which are good proxies for long-horizon demands but are still simulated environments. The scaffold and trained memory specialist were optimized per environment, so transfer across very different tasks remains an open question. Evaluation-Driven Development (EDDOps)

Deep Dive

AutoMem reframes external memory as an actively learned skill rather than a fixed module. The agent operates with a file-system memory and is allowed to perform file operations (create, append, search, read) as regular actions. During an episode the agent alternates two routines: LOG (decide what to record about recent events) and PLAN (decide which memory to consult before acting). Because every memory decision is explicit in the action trace, an automated reviewer can inspect full episodes and identify systematic failure modes. Human-in-the-Loop Pattern Two automated outer loops run over a shared inner-loop agent. The first loop uses a stronger reviewer model to iterate on the agent scaffold—prompts, file schemas, and small code changes—that guide how memory is used. The second loop has the reviewer curate good memory decisions from many episodes and then finetune a separate memory specialist that handles memory operations while keeping the task model frozen. Together these loops cut redundant writes, reduce empty searches and useless actions, and boost gameplay progression 2–4× on three long-horizon benchmarks. The result shows that better memory management can be more effective than simply scaling model size for problems that demand long-term information tracking. Agent Registry Pattern
Test your agentsValidate against real scenarios
Learn More
Credibility Assessment:

Authors include Stanford affiliation and an author with h-index ~21 (established); although an arXiv preprint, strong institutional and author signals warrant a high credibility rating.