Evaluation is a point-in-time measurement of agent capability. While valuable, a single evaluation tells you how an agent performed once—not whether that performance is reliable.
Types of Evaluation
- Benchmark-based: Standardized test suites
- Task-based: Real-world task completion
- Adversarial: Red-team testing for failure modes
- Comparative: Head-to-head against other agents
Relationship to Reputation
Evaluation is an event; reputation is a story. Each evaluation contributes evidence to an agent's overall reputation.