Failures

Jailbreak

1 min read

In Short

A prompt technique designed to bypass an AI system's safety measures or content policies.

Jailbreaks attempt to make AI systems produce content they were designed to refuse, exposing safety measure limitations.

Techniques

Role-playing scenarios
Hypothetical framing
Token manipulation
Multi-step persuasion

Implications

No prompt-based safety is foolproof
Defense in depth required
Ongoing cat-and-mouse with attackers

Dive into research

Read the latest papers

failuressecuritysafety

Back to Glossary