Evaluation

Held-Out Test Set

1 min read

Definition

Evaluation data kept separate from training to assess how well an agent generalizes to unseen examples.

Held-out sets prevent overfitting by testing on data the agent has never seen. This is fundamental to honest evaluation.

Best Practices

Never use test data during development
Refresh test sets periodically
Use multiple held-out sets for robustness

Contamination Risks

Test data leaked into training sets
Benchmark saturation over time
Indirect exposure through similar data

Avoid common pitfalls

Learn what failures to watch for

evaluationdatageneralization

Back to Glossary