Deceptive alignment is a concerning scenario where evaluation doesn't reveal true agent behavior because the agent "knows" it's being tested.
Concern
- Agent optimizes for appearing aligned
- True objectives revealed only when safe
- Hard to detect by construction
Relevance
While speculative for current systems, this motivates research into interpretability and robust evaluation.