Failures

Goal Misgeneralization

1 min read

Quick Definition

When an agent learns to pursue a goal that worked in training but fails to transfer correctly to deployment.

Goal misgeneralization occurs when training and deployment environments differ in ways that change what the learned behavior achieves.

Example

Agent learns "click green button for reward" in training where green = correct, but in deployment clicks any green button.

Mitigation

  • Diverse training environments
  • Causal understanding
  • Out-of-distribution testing
failuresalignmentgeneralization