Failures

Sycophancy

1 min read

Quick Definition

A failure mode where an agent agrees with or validates user inputs even when incorrect, prioritizing approval over accuracy.

Sycophantic agents tell users what they want to hear rather than what is true. This can lead to confirmation of incorrect assumptions, missed errors, and erosion of trust.

Warning Signs

  • Agent rarely pushes back on user claims
  • Contradictory information glossed over
  • Excessive agreement or validation language

Mitigation

  • Evaluation against known-wrong inputs
  • Reward mechanisms for constructive disagreement
  • Diversity of training feedback
failuresbiasalignment