Held-out sets prevent overfitting by testing on data the agent has never seen. This is fundamental to honest evaluation.
Best Practices
- Never use test data during development
- Refresh test sets periodically
- Use multiple held-out sets for robustness
Contamination Risks
- Test data leaked into training sets
- Benchmark saturation over time
- Indirect exposure through similar data