Evaluation

Evaluation

1 min read

What It Means

A single assessment event where an agent's performance is measured against specific criteria.

Evaluation is a point-in-time measurement of agent capability. While valuable, a single evaluation tells you how an agent performed once—not whether that performance is reliable.

Types of Evaluation

  • Benchmark-based: Standardized test suites
  • Task-based: Real-world task completion
  • Adversarial: Red-team testing for failure modes
  • Comparative: Head-to-head against other agents

Relationship to Reputation

Evaluation is an event; reputation is a story. Each evaluation contributes evidence to an agent's overall reputation.

evaluationbenchmarkscore-concept