The Big Picture
Coordinating specialized agents with a visual memory that saves milestone screenshots plus a web-aware tutorial retriever makes desktop automation much more reliable and better at handling unfamiliar tasks.
ON THIS PAGE
The Evidence
A central controller that coordinates a reflection-enabled memory and several specialist tools dramatically improves success on long, complex desktop tasks. The memory module keeps key screenshots and generates trajectory-level reflections to catch intent drift and loops context drift. A multimodal web searcher actively browses and retrieves visually aligned tutorials, letting the system handle out-of-distribution (unseen) problems. Together these pieces raise success rates across Ubuntu, Windows, and Mac benchmarks.
Data Highlights
165.8% success on OSWorld (an increase of 2.4 percentage points over prior state of the art).
263.5% success on WindowsAgentArena (up 6.9 percentage points).
346.0% success on MacOSArena (up 38.0 percentage points).
What This Means
Engineers building automation agents should consider splitting work across specialized modules (planner, memory auditor, web searcher, coder) to improve robustness. specialized modules Product and technical leaders evaluating automation tools can use these ideas to boost handling of unseen apps or versions without heavy manual data curation.
Not sure where to start?Get personalized recommendations
Key Figures

Fig 1: Figure 1: Current limitations in CUA framework.

Fig 2: Figure 2: Pipeline overview. OS-Symphony comprises three primary components: (1) The Orchestrator , acting as the system’s brain, responsible for task understanding and action prediction; (2) Tool Agents, consisting of Grounder , Coder , and Searcher , where the Searcher retrieves up-to-date tutorials in a human-like manner; and (3) The Reflection-Memory Agent , which compresses execution trajectories to maintain long-term memory and facilitate trajectory-level reflection.

Fig 3: Figure 3: Pipeline of RMA. At each step, RMA summarizes the previous action using pre- and post-action screenshots and the Orchestrator’s output, while evaluating the current GUI operation’s correctness. It then generates a reflection from all summaries and milestone screenshots, and determines whether the latest step is a milestone.

Fig 4: Figure 4: The Pass@K results on OSWorld. All experiments are carried out with GPT-5 and 100 steps limit.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreConsiderations
Results are reported only on desktop environments (Ubuntu, Windows, Mac); mobile platforms were not evaluated. The multi-agent design introduces substantial overhead: higher token use, more inter-agent communication, and execution that is currently tens of times slower than a human. Screen-based operation raises privacy and safety needs—deployments must include strict permission controls and data sanitization. privacy safeguards Memory Poisoning
Methodology & More
OS-Symphony is a modular system that centers an Orchestrator (the decision maker) coordinating a Reflection-Memory Agent and several Tool Agents (Searcher, Coder, Grounders). The memory agent retains milestone screenshots and compressed trajectory summaries, then audits past steps to produce high-level reflections that flag problems like repeated loops or intent drift. The Searcher actively browses web pages and retrieves visually aligned tutorials (not just text), enabling the orchestration layer to bring in external multimodal knowledge when the agent faces unfamiliar software or versions. Retrieval-Augmented Generation Evaluation across three desktop benchmarks shows consistent gains: meaningful improvements on Ubuntu (OSWorld) and larger jumps on cross-platform tests, especially Mac where prior agents struggled. The trade-offs are clear: the approach improves generalization and long-horizon stability but incurs computational and latency costs and depends on careful privacy safeguards. Future work should push faster coordinated reasoning, adapt the approach to mobile interfaces, and swap in improved memory or search components as they become available. Consensus-Based Decision Pattern
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
Affiliation with Nanjing University (recognized institution) and a mix of moderate h-indices across authors give it reasonable credibility despite arXiv venue.