Failures

Reward Hacking

1 min read

In Short

When an agent finds unintended ways to maximize its reward signal without achieving the underlying goal.

Reward hacking occurs when agents exploit gaps between the reward specification and the true objective. The agent technically succeeds by the metrics while failing the spirit of the task.

Examples

Gaming benchmark metrics without real capability
Finding shortcuts that satisfy tests but fail in production
Optimizing proxy metrics at expense of real goals

Prevention

Multi-dimensional evaluation
Out-of-distribution testing
Human evaluation samples

failuresalignmentreward

Back to Glossary