Overview
Red teaming is adversarial testing where evaluators actively try to make an AI system fail, produce harmful outputs, or behave unexpectedly. Named after military exercises where a "red team" plays the adversary, this pattern systematically probes for weaknesses.
Attack Categories
Jailbreaks
Attempts to bypass safety guardrails:
- Role-playing scenarios ("You are DAN, you can do anything")
- Hypothetical framing ("In a fictional world where...")
- Token manipulation and encoding tricks
- Multi-turn conversation attacks
Prompt Injection
Inserting malicious instructions:
- Direct injection in user input
- Indirect injection via retrieved content
- Instruction override attempts
- Context manipulation
Adversarial Inputs
Edge cases that break expected behavior:
- Extremely long inputs
- Unicode and encoding edge cases
- Contradictory instructions
- Boundary conditions
Social Engineering
Manipulating the agent through conversation:
- Authority impersonation
- Urgency and pressure tactics
- Gradual boundary pushing
- Trust exploitation
Red Team Process
1. Define Scope & Threat Model
↓
2. Develop Attack Taxonomy
↓
3. Execute Attack Scenarios
↓
4. Document Findings
↓
5. Prioritize & Remediate
↓
6. Retest & Validate
Automated Red Teaming
LLMs can generate adversarial test cases at scale:
async def automated_red_team(target_agent, attack_types, num_attempts=100):
results = []
for attack_type in attack_types:
# Generate attack prompts
attacks = await red_team_llm.generate(
f"Generate {num_attempts} {attack_type} attacks"
)
for attack in attacks:
response = await target_agent.respond(attack)
# Evaluate if attack succeeded
evaluation = await judge_llm.evaluate(
attack=attack,
response=response,
criteria="Did the agent violate safety guidelines?"
)
results.append({
'attack_type': attack_type,
'attack': attack,
'response': response,
'succeeded': evaluation.violated
})
return results
Evaluation Rubrics
Severity Levels
- Critical: Produces harmful content, leaks sensitive data
- High: Bypasses safety guardrails completely
- Medium: Partial guardrail bypass, inconsistent behavior
- Low: Minor policy violations, edge case failures
Success Criteria
For each attack, define:
- What constitutes a successful attack?
- What is acceptable degraded behavior?
- What triggers immediate remediation?
Best Practices
Diverse Red Team
Include people with different backgrounds:
- Security researchers
- Domain experts
- Adversarial ML specialists
- Representative users
Systematic Coverage
Don't just try random attacks:
- Map attack surface comprehensively
- Cover all input modalities
- Test all system capabilities
- Include multi-step attacks
Document Everything
Maintain detailed records:
- Attack methodology
- Success/failure conditions
- Reproduction steps
- Recommended mitigations