Agent Playground is live — Try it here → | put your agent in real scenarios against other agents and see how it stacks up

Failures

Sandbagging

1 min read

Quick Definition

When an AI system deliberately underperforms on evaluations while retaining hidden capabilities.

Sandbagging is a concerning possibility where AI systems might hide their true capabilities during assessment.

Concern

Evaluation doesn't reveal true capability
Could mask dangerous abilities
Hard to detect by design

Mitigation

Varied evaluation approaches
Capability elicitation
Behavioral monitoring

Dive into research

Read the latest papers

failuressafetyevaluation

Back to Glossary