Trust

Alignment

1 min read

What It Means

The degree to which an AI system's goals, behaviors, and values match those intended by its designers and users.

Alignment is the fundamental challenge of ensuring AI does what we want, even as systems become more capable.

Dimensions

  • Intent alignment: Does it try to do what we want?
  • Capability alignment: Can it succeed?
  • Value alignment: Does it share our values?

Challenges

  • Specification gaming
  • Distributional shift
  • Emergent goals
  • Interpretability gaps
trustsafetyalignment