Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

At a Glance

Storing whole-screen screenshots as memory helps recognition but creates new mistakes; storing small, action-focused image crops and retrieving them by subtask cuts multiple failure modes and raises task success by several percentage points.

What They Found

GUI-controlling models fail in four repeatable ways: misunderstanding the visible state, missing actions hidden behind other controls, getting the plan wrong, or mis-clicking the right spot. Adding full-screen screenshot memory helps with recognizing states but increases errors that come from hidden operations and imprecise targeting. Compressing memory into action-relevant image crops and narrowing retrieval to the current subtask—plus a recovery memory for detected mistakes—reduces all four failure types and improves end-to-end task success on benchmarks.

By the Numbers

1Full-image memory slightly reduced visual-state mistakes (73.1% → 69.6%) but increased hidden-operation blindness by 11.7 percentage points on OSWorld with GPT-5.4-mini.
2Action-Grounded Visual Memory (AGMem) cut visual state misunderstanding by ~37 percentage points and reduced hidden-operation blindness by ~26 percentage points (OSWorld, GPT-5.4-mini).
3AGMem raised end-to-end task accuracy by about 6.8–9.1 percentage points on OSWorld, outperforming both no-memory and naive full-image memory configurations.

What This Means

Engineers building AI that interacts with user interfaces should care because memory content and retrieval strategy directly change failure patterns and overall success. Technical leads evaluating agent reliability should consider action-focused memory and recovery examples as practical ways to reduce costly missteps in automation. Researchers studying multimodal agents can use the failure taxonomy and AGMem as a blueprint for improving visual memory design.
Not sure where to start?Get personalized recommendations
Learn More

Key Figures

(a) Cognitive Failure
Fig 1: (a) Cognitive Failure
Figure 2 : Failure mode distributions. Failure mode distribution of GPT-5.4-mini across OSWorld, AgentNetBench, and WebForge. Each benchmark exhibits a distinct dominant mode: hidden operation blindness on OSWorld, grounding error on WebForge, and visual state misunderstanding on AgentNetBench.
Fig 2: Figure 2 : Failure mode distributions. Failure mode distribution of GPT-5.4-mini across OSWorld, AgentNetBench, and WebForge. Each benchmark exhibits a distinct dominant mode: hidden operation blindness on OSWorld, grounding error on WebForge, and visual state misunderstanding on AgentNetBench.
Figure 3 : Per-mode failure rates on OSWorld with GPT-5.4-mini. Full-image memory reduces state-level failures (cognitive failure, visual state misunderstanding) but worsens action-level failures (hidden operation blindness, grounding error). AGMem is the only configuration that consistently reduces all failure modes.
Fig 3: Figure 3 : Per-mode failure rates on OSWorld with GPT-5.4-mini. Full-image memory reduces state-level failures (cognitive failure, visual state misunderstanding) but worsens action-level failures (hidden operation blindness, grounding error). AGMem is the only configuration that consistently reduces all failure modes.
Figure 4 : Conceptual illustration of action-grounded visual memory. Without memory, an agent may select an incorrect UI target. Full-image visual memory can provide useful prior experience, but the action-relevant cue may remain small or spatially ambiguous within the full screenshot, causing incorrect grounding. Action-grounded visual memory focuses the retrieved example on the region associated with the demonstrated action, making the relevant target easier to identify and localize.
Fig 4: Figure 4 : Conceptual illustration of action-grounded visual memory. Without memory, an agent may select an incorrect UI target. Full-image visual memory can provide useful prior experience, but the action-relevant cue may remain small or spatially ambiguous within the full screenshot, causing incorrect grounding. Action-grounded visual memory focuses the retrieved example on the region associated with the demonstrated action, making the relevant target easier to identify and localize.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Keep in Mind

Results were measured on specific GUI benchmarks and primarily evaluated with a particular large model, so absolute gains may vary with different models or UIs. AGMem depends on accurate subtask decomposition and reliable cropping/retrieval; poor crops or mismatched retrieval can limit benefits. Any persistent visual memory must be filtered for sensitive content before deployment to avoid privacy risks.

Methodology & More

Modern agents that control software by reading screenshots tend to fail in four repeatable ways: cognitive mistakes (wrong plan), visual state misunderstanding (misreading what’s on screen), hidden-operation blindness (missing controls that only appear after another action), and grounding errors (aiming roughly but clicking the wrong pixel). Naively appending full screenshots from past steps helps with simple recognition problems but often distracts the agent with irrelevant visual clutter, causing more errors in hidden actions and precise targeting.Tree of Thoughts Pattern Action-Grounded Visual Memory (AGMem) fixes this by compacting memories into small image crops tied to the action they supported, narrowing retrieval to steps relevant to the current subtask, and maintaining a separate recovery memory for examples of corrective behavior. Experimental results on multiple GUI benchmarks show AGMem reduces each of the four failure types and raises end-to-end success by roughly 6.8–9.1 percentage points compared with vanilla and full-image memory setups. Practical implications: what you store and how you fetch it matters—focused, action-aligned visual examples are more useful than large, noisy screenshots, and recovery-aware retrieval helps when earlier mistakes create misleading states. Handoff Pattern
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

ArXiv preprint with mostly low h-index authors (highest h-index 8). No prominent affiliations or top-venue publication; suggests emerging work rather than established/reliable source.