How Simulated Users Reveal Why Smart Assistants Still Fail Most Tasks

The Big Picture

Simulating active users shows current proactive assistants succeed on only 42% of tasks; realistic user constraints and execution reliability—not just understanding—are the main failure points.

ON THIS PAGE

The Evidence

A realistic simulation where users must navigate app screens while assistants can call backend functions uncovers gaps that static tests miss. The Proactive Agent Research Environment (Pare) models users with stateful app interfaces and evaluates assistants on goal inference, timing, and multi-app coordination. Across 143 tasks, even top models only reached a 42% success rate, and smaller models mainly fail during the execution step rather than goal recognition. Separating ‘observe’ from ‘execute’ (ask before acting) helps reliability but doesn't close the gap. Event-Driven Agent Pattern Hierarchical Multi-Agent Pattern

Data Highlights

1Top frontier models achieved a 42% overall task success rate on Pare-Bench

2Pare-Bench includes 143 diverse proactive scenarios across communication, productivity, scheduling, and lifestyle apps

3Seven models were evaluated (4 closed-source, 3 open-weight), showing smaller models struggle primarily with executing actions

What This Means

Engineers building assistants should use active user simulation to test whether their agent's suggestions are actionable and accepted, not just plausible. Product and privacy leads should note the value of on-device inference and API-level observation for user privacy and practical deployment. Researchers can use Pare-Bench to measure improvements in goal inference, timing, and multi-app orchestration. Evaluation-Driven Development (EDDOps)

Not sure where to start?Get personalized recommendations

Learn More

Key Figures

$Figure 1 : Overview of the Pare framework architecture. Pare framework consists of an event-based environment that models Stateful App transitions. The user-environment interface exposes selective tools based on the current state S t S_{t} of the currently active app, and user actions result in app state transitions. Whereas the agent-environment interface exposes all tools as a flat API structure to enable efficient information gathering and task execution by Proactive Assistants.$

Fig 1: Figure 1 : Overview of the Pare framework architecture. Pare framework consists of an event-based environment that models Stateful App transitions. The user-environment interface exposes selective tools based on the current state S t S_{t} of the currently active app, and user actions result in app state transitions. Whereas the agent-environment interface exposes all tools as a flat API structure to enable efficient information gathering and task execution by Proactive Assistants.

Figure 2 : Comparison of tool call chains for sending a message. Non-FSM agent frameworks allow direct API calls (left), while Pare ’s FSM-based design requires the user to navigate through sequential app screens (right), matching how real users interact with mobile apps.

Fig 2: Figure 2 : Comparison of tool call chains for sending a message. Non-FSM agent frameworks allow direct API calls (left), while Pare ’s FSM-based design requires the user to navigate through sequential app screens (right), matching how real users interact with mobile apps.

Figure 3 : Proposal Rate, Acceptance rate, and Execution Success of different models with respect to tool failure probability evaluation across 4 runs.

Fig 3: Figure 3 : Proposal Rate, Acceptance rate, and Execution Success of different models with respect to tool failure probability evaluation across 4 runs.

Figure 4 : Proposal Rate, Acceptance rate, and Execution Success of different models with respect to environment event noise evaluation across 4 runs.

Fig 4: Figure 4 : Proposal Rate, Acceptance rate, and Execution Success of different models with respect to environment event noise evaluation across 4 runs.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Keep in Mind

Simulated users are driven by language models and finite state machines, so behavior may not perfectly match real human users or edge cases. Scenarios were generated by models and then human-verified, which helps scale but can introduce generation biases. The environment abstracts screen content as action events (API-level), so visual or accessibility interactions that matter in real phones may be underrepresented. Sub-Agent Delegation Pattern

Methodology & More

Pare creates an asymmetric simulation where a user agent must navigate realistic app screens (modeled as finite state machines) while a proactive assistant has flat backend access. The user agent can accept or reject proposals, and the assistant must infer goals from user actions and the event stream before proposing plans. The Observe-Execute architecture separates continuous monitoring from execution: the assistant proposes a plan and only executes after user approval, preserving user control. Emergence-Aware Monitoring Pattern Using an LLM-driven scenario generator plus human validation, Pare-Bench assembles 143 tasks spanning messaging, calendars, shopping, and more. Evaluations use acceptance rate, proposal rate, and multiple success metrics (single-run success, repeated-run reliability, and average success). Testing seven models shows top systems hit only 42% success, with smaller models failing mostly at executing multi-step actions. The release of Pare and Pare-Bench encourages more realistic testing of proactive assistants, highlights the need to improve reliable execution and timing, and supports on-device deployment to protect user data.

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Authors from a recognizable university (UC Santa Barbara) and includes an author with h-index ~22 (established). Although it's an arXiv preprint, author reputation and institution raise credibility.

agent-to-agent evaluation multi-agent orchestration continuous agent evaluation agent reliability

Not sure where to start?