Failures

Prompt Injection

1 min read

Definition

An attack where malicious instructions are embedded in user input to override or manipulate an agent's intended behavior.

Prompt injection exploits the fact that LLMs process instructions and data in the same input stream, making it hard to distinguish legitimate from malicious commands.

Attack Vectors

  • Direct injection in user messages
  • Indirect injection via retrieved content
  • Jailbreaks that disable safety features
  • Context manipulation

Defenses

  • Input sanitization
  • Instruction hierarchy
  • Output filtering
  • Anomaly detection
failuressecurityattacks