Failures

Jailbreak

1 min read

In Short

A prompt technique designed to bypass an AI system's safety measures or content policies.

Jailbreaks attempt to make AI systems produce content they were designed to refuse, exposing safety measure limitations.

Techniques

  • Role-playing scenarios
  • Hypothetical framing
  • Token manipulation
  • Multi-step persuasion

Implications

  • No prompt-based safety is foolproof
  • Defense in depth required
  • Ongoing cat-and-mouse with attackers
failuressecuritysafety