Governance

Guardrails

1 min read

Quick Definition

Safety constraints that prevent agents from taking harmful or unauthorized actions, even if instructed to do so.

Guardrails are defensive mechanisms that bound agent behavior within acceptable limits. They act as safety nets independent of the agent's decision-making.

Types

Input guardrails: Filter or reject harmful prompts
Output guardrails: Block or modify unsafe responses
Action guardrails: Prevent unauthorized tool use

Implementation

Can be rule-based (filters, blocklists) or model-based (classifier models for detection).

safetygovernancesecurity

Back to Glossary