Governance

Guardrails

1 min read

Quick Definition

Safety constraints that prevent agents from taking harmful or unauthorized actions, even if instructed to do so.

Guardrails are defensive mechanisms that bound agent behavior within acceptable limits. They act as safety nets independent of the agent's decision-making.

Types

  • Input guardrails: Filter or reject harmful prompts
  • Output guardrails: Block or modify unsafe responses
  • Action guardrails: Prevent unauthorized tool use

Implementation

Can be rule-based (filters, blocklists) or model-based (classifier models for detection).

safetygovernancesecurity