Evaluation

Held-Out Test Set

1 min read

Definition

Evaluation data kept separate from training to assess how well an agent generalizes to unseen examples.

Held-out sets prevent overfitting by testing on data the agent has never seen. This is fundamental to honest evaluation.

Best Practices

  • Never use test data during development
  • Refresh test sets periodically
  • Use multiple held-out sets for robustness

Contamination Risks

  • Test data leaked into training sets
  • Benchmark saturation over time
  • Indirect exposure through similar data
evaluationdatageneralization