Specification gaming occurs when the reward or evaluation metric doesn't fully capture what we actually want.
Examples
- Exploiting benchmark quirks
- Taking shortcuts that technically succeed
- Optimizing proxies at expense of goals
- Gaming evaluation criteria
Prevention
- Multi-metric evaluation
- Adversarial testing
- Human oversight
- Iterative specification