A Simple Checklist That Stops Web-Controlling AI From Getting Stuck

Key Takeaway

Setting and checking short, verifiable mini-goals during both runtime and training gives web-controlling AI a dense progress signal, dramatically reducing mid-task “stuck” failures and improving long-horizon success.

ON THIS PAGE

What They Found

Automated analysis showed the dominant failure mode in web navigation is poor mid-task planning: agents get stuck or pursue unreasonable long-range steps instead of logical next milestones. Generating compact subgoals with a stronger model, using them at inference to guide planning, and turning them into shaped rewards during offline training (MiRA) helps agents chain steps reliably. The approach improves both proprietary inference engines and open-source policies, reducing stuck behavior and enabling sequential subgoal completion across long tasks. Evaluation-Driven Development

Data Highlights

1Large foundation model scored 75% on general UI tasks but only 36% on a hard web navigation benchmark, showing a big drop on open-ended web tasks.

2Out-of-the-box proprietary model exhibited mid-task “stuck” behavior in nearly 50% of evaluation trajectories on WebArena-Lite.

3A smaller open model after supervised fine-tuning still failed to progress in over 30% of cases; milestone-driven training reduced that dominant error mode.

Why It Matters

Engineers building agents for web automation or UI control should care because adding simple, verifiable subgoals can make agents recoverable and more reliable in long workflows. Technical leaders and researchers evaluating agent reliability can use subgoal-based inference and milestone-shaped training to raise success rates without vastly larger models or datasets. Agent Service Mesh Pattern

Explore evaluation patternsSee how to apply these findings

Learn More

Key Figures

Fig 1: Figure 1 : Overview of the Milestoning the Agents

Fig 2: Figure 3 : Overview of Failure mode distribution of existing out-of-box models

$Figure 5 : Dynamic Milestoning Framework for Enhanced LLM Agent Inference. The architecture depicts the real-time feedback loop where the online agent’s actions are monitored against a SubGoals Checklist . The reasoning model itself uses trace reflection to determine progress ( 𝐳 t + 1 \mathbf{z}_{t+1} ), providing a dense, grounded signal that directs the agent’s next planning step and enables self-correction.$

Fig 3: Figure 5 : Dynamic Milestoning Framework for Enhanced LLM Agent Inference. The architecture depicts the real-time feedback loop where the online agent’s actions are monitored against a SubGoals Checklist . The reasoning model itself uses trace reflection to determine progress ( 𝐳 t + 1 \mathbf{z}_{t+1} ), providing a dense, grounded signal that directs the agent’s next planning step and enables self-correction.

$Figure 6 : The MiRA-RL Training Pipeline. During the interaction phase, the agent generates trajectories—such as the successful “Add a bicycle” task or the failed “Find recent orders” task. These are evaluated by an Auto Rater (binary final success) and a SubGoal Checker (intermediate progress). This data trains two distinct critics: a Value Critic V ϕ V_{\phi} for final success and a Potential Critic P ψ P_{\psi} that models progress. The Actor policy is updated using shaped rewards, with updates stabilized by Actor Perplexity Filtering.$

Fig 4: Figure 6 : The MiRA-RL Training Pipeline. During the interaction phase, the agent generates trajectories—such as the successful “Add a bicycle” task or the failed “Find recent orders” task. These are evaluated by an Auto Rater (binary final success) and a SubGoal Checker (intermediate progress). This data trains two distinct critics: a Value Critic V ϕ V_{\phi} for final success and a Potential Critic P ψ P_{\psi} that models progress. The Actor policy is updated using shaped rewards, with updates stabilized by Actor Perplexity Filtering.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Considerations

Results are reported on a curated subset (WebArena-Lite, 165 tasks) rather than the full benchmark, so domain coverage is limited. The method depends on a strong teacher model to generate reliable subgoals; weaker teachers may produce noisy or misleading milestones. Shaped rewards risk overemphasizing intermediate milestones unless balanced by a final-goal value critic and filtering, so careful tuning is needed for different environments. Rogue Agent Behavior

Deep Dive

An automated failure analyzer revealed that long web tasks fail mainly because agents lose track of useful intermediate steps and get stuck in non-productive loops. To fix that, generate compact, verifiable subgoals (milestones) from a stronger teacher model and use them two ways: at inference as a lightweight checklist to guide planning, and during offline policy training as shaped rewards to give denser feedback. The inference component helps the agent pick the next logical action, while the training component (MiRA) uses a learned potential function to turn subgoal progress into incremental rewards. Chain of Thought Pattern MiRA’s training pipeline keeps two critics: one for final success and one that models progress across milestones. Trajectories are filtered and used to alternately refine the policy and the potential critic, producing a curriculum that focuses on hard cases uncovered by the failure analyzer. On WebArena-Lite the approach reduced the dominant “stuck midway” error mode and helped open-source models match or outperform much larger proprietary backbones on long-horizon web tasks. The takeaway: dense, structured milestones—used both during decision time and learning—are a practical lever to make web-controlling agents more reliable without only scaling model size. Evaluation-Driven Development

Avoid common pitfallsLearn what failures to watch for

Learn More

Credibility Assessment:

Mostly low h-index authors and arXiv venue, but includes at least one recognizable researcher (Edward Grefenstette) which raises credibility relative to purely unknown authors.

agent failure modes agent reliability continuous agent evaluation

Not sure where to start?