EvaluationProduction Ready

ClawBench

Name: ClawBench
Rating: 3.0 (381 reviews)
Author: TIGER-AI-Lab

by TIGER-AI-Lab

Real-world benchmark for browser agents across 153 live-web tasks

Python

Updated Jun 6, 2026

381

Stars

Forks

View on GitHub

Summary

Benchmarks browser-based AI agents on 153 everyday online tasks across 144 live websites. Records 5 layers of interaction, uses DOM-match scoring and an LLM judge to evaluate task success and behavior. The dataset and harness emphasize real-world web complexity real-world web complexity rather than synthetic prompts.

Why It Matters

As agents act in the wild, surface-level benchmarks miss real web failure modes that harm trust. ClawBench exposes practical reliability and failure patterns by running agents on live sites and capturing rich interaction traces. That evidence is essential for building agent track records A2A evaluation and for meaningful continuous agent evaluation pipelines.

Ideal For

Researchers and engineers testing browser-based agents who need reproducible, real-world task evaluations and detailed interaction traces. This fits well with the Role-Based Agent Pattern for designing and validating agent capabilities.

How It's Used

Evaluating browser agent reliability on real websites to reveal practical failure modes
Collecting multi-layer interaction traces for debugging agent delegation and automation logic
Benchmarking LLM-judged task success to compare agent versions or control strategies

See related protocols

Standards this tool supports