Evaluation

Red Teaming

1 min read

In Short

Adversarial testing where evaluators actively try to make an AI system fail, misbehave, or produce harmful outputs.

Red teaming borrows from cybersecurity practices to stress-test AI systems. Red teams attempt to find vulnerabilities before malicious actors do.

Approaches

  • Manual red teaming: Human experts craft adversarial inputs
  • Automated red teaming: AI systems generate attack vectors
  • Hybrid: AI-generated attacks refined by humans

What Red Teams Test

  • Safety guardrail bypasses
  • Harmful content generation
  • Prompt injection vulnerabilities
  • Factual accuracy under pressure
evaluationsafetyadversarial