safetymediumcommon

Guardrails Pattern

Production agents requiring content safety and policy compliance

Overview

The Challenge

Agents can generate harmful, biased, or policy-violating outputs, and catching these issues after the fact is costly and dangerous.

The Solution

Implement input and output guardrails that validate, filter, and constrain agent behavior in real-time, preventing harmful actions before they execute.

When to Use
  • Customer-facing agents
  • Regulated industries (healthcare, finance)
  • Systems processing user-generated content
  • Agents with tool or data access
When NOT to Use
  • Internal development tools
  • Research prototypes with trusted users
  • When false positives are unacceptable

Trade-offs

Advantages
  • +Catches issues before they reach users
  • +Satisfies compliance requirements
  • +Provides consistent policy enforcement
  • +Can be updated independently of agents
Considerations
  • Adds latency to every request
  • Can create false positives
  • Requires ongoing tuning
  • May block legitimate edge cases
Implement this pattern with our SDK
Get RepKit

Deep Dive

Overview

The Guardrails pattern adds safety layers around agent execution. Input guardrails validate and sanitize incoming requests; output guardrails check agent responses before they're delivered or executed. This creates a protective envelope around agent behavior.

Architecture

User Input → [Input Guardrails] → Agent → [Output Guardrails] → Response
                    │                              │
                    ▼                              ▼
              Block/Modify                   Block/Modify

Input Guardrails

Content Filtering

def input_guardrail(user_input):
    # Check for prompt injection patterns
    if contains_injection_pattern(user_input):
        return blocked("Potential prompt injection detected")

    # Check for prohibited topics
    if matches_prohibited_topic(user_input):
        return blocked("Topic not allowed")

    # Check for PII
    if contains_pii(user_input):
        return sanitize_pii(user_input)

    return allowed(user_input)

Rate Limiting

Prevent abuse through excessive requests.

Authentication

Verify user identity and permissions.

Output Guardrails

Response Validation

def output_guardrail(agent_response):
    # Check for hallucinated facts
    if confidence_too_low(agent_response):
        return add_uncertainty_disclaimer(agent_response)

    # Check for harmful content
    if contains_harmful_content(agent_response):
        return blocked("Response contains harmful content")

    # Check for data leakage
    if contains_sensitive_data(agent_response):
        return redact_sensitive_data(agent_response)

    return allowed(agent_response)

Action Validation

For tool-using agents:

def action_guardrail(proposed_action):
    # Check against allowlist
    if action.tool not in ALLOWED_TOOLS:
        return blocked("Tool not permitted")

    # Check parameters
    if not validate_parameters(action):
        return blocked("Invalid parameters")

    # Check for dangerous operations
    if is_destructive_action(action):
        return require_human_approval(action)

    return allowed(action)

OpenAI Agents SDK Guardrails

The OpenAI Agents SDK includes guardrails as a core primitive:

from openai_agents import Agent, InputGuardrail, OutputGuardrail

agent = Agent(
    name="assistant",
    input_guardrails=[
        InputGuardrail(check_injection),
        InputGuardrail(check_topic_policy)
    ],
    output_guardrails=[
        OutputGuardrail(check_harmful_content),
        OutputGuardrail(check_hallucination)
    ]
)

Guardrail Categories

Safety Guardrails

  • Harmful content detection
  • Violence/self-harm prevention
  • Illegal activity blocking

Compliance Guardrails

  • PII/PHI protection (HIPAA, GDPR)
  • Financial advice disclaimers
  • Industry-specific regulations

Quality Guardrails

  • Hallucination detection
  • Factual accuracy checks
  • Consistency validation

Security Guardrails

  • Prompt injection detection
  • Data exfiltration prevention
  • Access control enforcement

Implementation Tips

  • Run guardrails in parallel where possible
  • Log all guardrail activations for analysis
  • Tune sensitivity to balance safety vs. usability
  • Update guardrails based on new attack patterns
Want to learn more patterns?
Explore Learning Paths
Considerations

Guardrails add latency and can create false positives. Balance protection level against user experience.

Dimension Scores
Safety
5/5
Accuracy
3/5
Cost
3/5
Speed
3/5
Implementation
Complexitymedium
Implementation Checklist
Policy definitions
Content classifiers
Logging infrastructure
0/3 complete
Tags
safetyvalidationfilteringcompliancesecurity

Was this pattern helpful?