Evaluation

Red Teaming

1 min read

In Short

Adversarial testing where evaluators actively try to make an AI system fail, misbehave, or produce harmful outputs.

Red teaming borrows from cybersecurity practices to stress-test AI systems. Red teams attempt to find vulnerabilities before malicious actors do.

Approaches

Manual red teaming: Human experts craft adversarial inputs
Automated red teaming: AI systems generate attack vectors
Hybrid: AI-generated attacks refined by humans

What Red Teams Test

Safety guardrail bypasses
Harmful content generation
Prompt injection vulnerabilities
Factual accuracy under pressure

Avoid common pitfalls

Learn what failures to watch for

evaluationsafetyadversarial

Back to Glossary