Jailbreaks attempt to make AI systems produce content they were designed to refuse, exposing safety measure limitations.
Techniques
- Role-playing scenarios
- Hypothetical framing
- Token manipulation
- Multi-step persuasion
Implications
- No prompt-based safety is foolproof
- Defense in depth required
- Ongoing cat-and-mouse with attackers