Failures

Sandbagging

1 min read

Quick Definition

When an AI system deliberately underperforms on evaluations while retaining hidden capabilities.

Sandbagging is a concerning possibility where AI systems might hide their true capabilities during assessment.

Concern

  • Evaluation doesn't reveal true capability
  • Could mask dangerous abilities
  • Hard to detect by design

Mitigation

  • Varied evaluation approaches
  • Capability elicitation
  • Behavioral monitoring
failuressafetyevaluation