Back to Ecosystem Pulse
EvaluationProduction Ready
ClawBench
by TIGER-AI-Lab
Real-world benchmark for browser agents across 153 live-web tasks
Python
Updated Jun 6, 2026
Share:
Summary
Benchmarks browser-based AI agents on 153 everyday online tasks across 144 live websites. Records 5 layers of interaction, uses DOM-match scoring and an LLM judge to evaluate task success and behavior. The dataset and harness emphasize real-world web complexity real-world web complexity rather than synthetic prompts.
Why It Matters
As agents act in the wild, surface-level benchmarks miss real web failure modes that harm trust. ClawBench exposes practical reliability and failure patterns by running agents on live sites and capturing rich interaction traces. That evidence is essential for building agent track records A2A evaluation and for meaningful continuous agent evaluation pipelines.
Ideal For
Researchers and engineers testing browser-based agents who need reproducible, real-world task evaluations and detailed interaction traces. This fits well with the Role-Based Agent Pattern for designing and validating agent capabilities.
How It's Used
- Evaluating browser agent reliability on real websites to reveal practical failure modes
- Collecting multi-layer interaction traces for debugging agent delegation and automation logic
- Benchmarking LLM-judged task success to compare agent versions or control strategies
Topics
agent-evaluationagentic-aiai-agent-benchmarkai-agentsbenchmarkbrowser-agentbrowser-automationbrowser-usechrome-agentchrome-extension+10 more
Similar Tools
agent-playgroundagent-arena
Keywords
agent-evaluationbrowser-agentA2A evaluationagent track record