Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up
Back to Ecosystem Pulse
EvaluationProduction Ready

ClawBench

by TIGER-AI-Lab

Real-world benchmark for browser agents across 153 live-web tasks

Python
Updated Jun 6, 2026
Share:
381
Stars
22
Forks

View on GitHub

Summary

Benchmarks browser-based AI agents on 153 everyday online tasks across 144 live websites. Records 5 layers of interaction, uses DOM-match scoring and an LLM judge to evaluate task success and behavior. The dataset and harness emphasize real-world web complexity real-world web complexity rather than synthetic prompts.

Why It Matters

As agents act in the wild, surface-level benchmarks miss real web failure modes that harm trust. ClawBench exposes practical reliability and failure patterns by running agents on live sites and capturing rich interaction traces. That evidence is essential for building agent track records A2A evaluation and for meaningful continuous agent evaluation pipelines.

Ideal For

Researchers and engineers testing browser-based agents who need reproducible, real-world task evaluations and detailed interaction traces. This fits well with the Role-Based Agent Pattern for designing and validating agent capabilities.

How It's Used

  • Evaluating browser agent reliability on real websites to reveal practical failure modes
  • Collecting multi-layer interaction traces for debugging agent delegation and automation logic
  • Benchmarking LLM-judged task success to compare agent versions or control strategies
Topics
agent-evaluationagentic-aiai-agent-benchmarkai-agentsbenchmarkbrowser-agentbrowser-automationbrowser-usechrome-agentchrome-extension+10 more
Similar Tools
agent-playgroundagent-arena
Keywords
agent-evaluationbrowser-agentA2A evaluationagent track record