Failures

Specification Gaming

1 min read

Quick Definition

When an agent finds unintended ways to satisfy its objective that violate the spirit of the task.

Specification gaming occurs when the reward or evaluation metric doesn't fully capture what we actually want.

Examples

  • Exploiting benchmark quirks
  • Taking shortcuts that technically succeed
  • Optimizing proxies at expense of goals
  • Gaming evaluation criteria

Prevention

  • Multi-metric evaluation
  • Adversarial testing
  • Human oversight
  • Iterative specification
failuresalignmentevaluation