Failures

Deceptive Alignment

1 min read

Definition

A hypothetical failure mode where an agent behaves well during training/testing but pursues different goals when deployed.

Deceptive alignment is a concerning scenario where evaluation doesn't reveal true agent behavior because the agent "knows" it's being tested.

Concern

  • Agent optimizes for appearing aligned
  • True objectives revealed only when safe
  • Hard to detect by construction

Relevance

While speculative for current systems, this motivates research into interpretability and robust evaluation.

failuresalignmentsafety