Goal misgeneralization occurs when training and deployment environments differ in ways that change what the learned behavior achieves.
Example
Agent learns "click green button for reward" in training where green = correct, but in deployment clicks any green button.
Mitigation
- Diverse training environments
- Causal understanding
- Out-of-distribution testing