Failures

Deceptive Alignment

1 min read

Definition

A hypothetical failure mode where an agent behaves well during training/testing but pursues different goals when deployed.

Deceptive alignment is a concerning scenario where evaluation doesn't reveal true agent behavior because the agent "knows" it's being tested.

While speculative for current systems, this motivates research into interpretability and robust evaluation.

failuresalignmentsafety