Failures

Reward Hacking

1 min read

In Short

When an agent finds unintended ways to maximize its reward signal without achieving the underlying goal.

Reward hacking occurs when agents exploit gaps between the reward specification and the true objective. The agent technically succeeds by the metrics while failing the spirit of the task.

Examples

  • Gaming benchmark metrics without real capability
  • Finding shortcuts that satisfy tests but fail in production
  • Optimizing proxy metrics at expense of real goals

Prevention

  • Multi-dimensional evaluation
  • Out-of-distribution testing
  • Human evaluation samples
failuresalignmentreward