The Big Picture
Simulating active users shows current proactive assistants succeed on only 42% of tasks; realistic user constraints and execution reliability—not just understanding—are the main failure points.
ON THIS PAGE
The Evidence
A realistic simulation where users must navigate app screens while assistants can call backend functions uncovers gaps that static tests miss. The Proactive Agent Research Environment (Pare) models users with stateful app interfaces and evaluates assistants on goal inference, timing, and multi-app coordination. Across 143 tasks, even top models only reached a 42% success rate, and smaller models mainly fail during the execution step rather than goal recognition. Separating ‘observe’ from ‘execute’ (ask before acting) helps reliability but doesn't close the gap. Event-Driven Agent Pattern Hierarchical Multi-Agent Pattern
Data Highlights
1Top frontier models achieved a 42% overall task success rate on Pare-Bench
2Pare-Bench includes 143 diverse proactive scenarios across communication, productivity, scheduling, and lifestyle apps
3Seven models were evaluated (4 closed-source, 3 open-weight), showing smaller models struggle primarily with executing actions
What This Means
Engineers building assistants should use active user simulation to test whether their agent's suggestions are actionable and accepted, not just plausible. Product and privacy leads should note the value of on-device inference and API-level observation for user privacy and practical deployment. Researchers can use Pare-Bench to measure improvements in goal inference, timing, and multi-app orchestration. Evaluation-Driven Development (EDDOps)
Not sure where to start?Get personalized recommendations
Key Figures

Fig 1: Figure 1 : Overview of the Pare framework architecture. Pare framework consists of an event-based environment that models Stateful App transitions. The user-environment interface exposes selective tools based on the current state S t S_{t} of the currently active app, and user actions result in app state transitions. Whereas the agent-environment interface exposes all tools as a flat API structure to enable efficient information gathering and task execution by Proactive Assistants.

Fig 2: Figure 2 : Comparison of tool call chains for sending a message. Non-FSM agent frameworks allow direct API calls (left), while Pare ’s FSM-based design requires the user to navigate through sequential app screens (right), matching how real users interact with mobile apps.

Fig 3: Figure 3 : Proposal Rate, Acceptance rate, and Execution Success of different models with respect to tool failure probability evaluation across 4 runs.

Fig 4: Figure 4 : Proposal Rate, Acceptance rate, and Execution Success of different models with respect to environment event noise evaluation across 4 runs.
Ready to evaluate your AI agents?
Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.
Learn MoreKeep in Mind
Simulated users are driven by language models and finite state machines, so behavior may not perfectly match real human users or edge cases. Scenarios were generated by models and then human-verified, which helps scale but can introduce generation biases. The environment abstracts screen content as action events (API-level), so visual or accessibility interactions that matter in real phones may be underrepresented. Sub-Agent Delegation Pattern
Methodology & More
Pare creates an asymmetric simulation where a user agent must navigate realistic app screens (modeled as finite state machines) while a proactive assistant has flat backend access. The user agent can accept or reject proposals, and the assistant must infer goals from user actions and the event stream before proposing plans. The Observe-Execute architecture separates continuous monitoring from execution: the assistant proposes a plan and only executes after user approval, preserving user control. Emergence-Aware Monitoring Pattern Using an LLM-driven scenario generator plus human validation, Pare-Bench assembles 143 tasks spanning messaging, calendars, shopping, and more. Evaluations use acceptance rate, proposal rate, and multiple success metrics (single-run success, repeated-run reliability, and average success). Testing seven models shows top systems hit only 42% success, with smaller models failing mostly at executing multi-step actions. The release of Pare and Pare-Bench encourages more realistic testing of proactive assistants, highlights the need to improve reliable execution and timing, and supports on-device deployment to protect user data.
Avoid common pitfallsLearn what failures to watch for
Credibility Assessment:
Authors from a recognizable university (UC Santa Barbara) and includes an author with h-index ~22 (established). Although it's an arXiv preprint, author reputation and institution raise credibility.