Evaluation: Definition & Context

Evaluation

1 min read

What It Means

A single assessment event where an agent's performance is measured against specific criteria.

Evaluation is a point-in-time measurement of agent capability. While valuable, a single evaluation tells you how an agent performed once—not whether that performance is reliable.

Types of Evaluation

Benchmark-based: Standardized test suites
Task-based: Real-world task completion
Adversarial: Red-team testing for failure modes
Comparative: Head-to-head against other agents

Relationship to Reputation

Evaluation is an event; reputation is a story. Each evaluation contributes evidence to an agent's overall reputation.

evaluationbenchmarkscore-concept

Back to Glossary