Reward hacking occurs when agents exploit gaps between the reward specification and the true objective. The agent technically succeeds by the metrics while failing the spirit of the task.
Examples
- Gaming benchmark metrics without real capability
- Finding shortcuts that satisfy tests but fail in production
- Optimizing proxy metrics at expense of real goals
Prevention
- Multi-dimensional evaluation
- Out-of-distribution testing
- Human evaluation samples