safetymediumcommon

Guardrails Pattern

Production agents requiring content safety and policy compliance

Overview

The Challenge

Agents can generate harmful, biased, or policy-violating outputs, and catching these issues after the fact is costly and dangerous.

The Solution

Implement input and output guardrails that validate, filter, and constrain agent behavior in real-time, preventing harmful actions before they execute.

When to Use

Customer-facing agents
Regulated industries (healthcare, finance)
Systems processing user-generated content
Agents with tool or data access

When NOT to Use

Internal development tools
Research prototypes with trusted users
When false positives are unacceptable

Trade-offs

Advantages

+Catches issues before they reach users
+Satisfies compliance requirements
+Provides consistent policy enforcement
+Can be updated independently of agents

Considerations

−Adds latency to every request
−Can create false positives
−Requires ongoing tuning
−May block legitimate edge cases

Implement this pattern with our SDK

Get RepKit

Deep Dive

Overview

The Guardrails pattern adds safety layers around agent execution. Input guardrails validate and sanitize incoming requests; output guardrails check agent responses before they're delivered or executed. This creates a protective envelope around agent behavior.

Architecture

User Input → [Input Guardrails] → Agent → [Output Guardrails] → Response
                    │                              │
                    ▼                              ▼
              Block/Modify                   Block/Modify

Input Guardrails

Content Filtering

def input_guardrail(user_input):
    # Check for prompt injection patterns
    if contains_injection_pattern(user_input):
        return blocked("Potential prompt injection detected")

    # Check for prohibited topics
    if matches_prohibited_topic(user_input):
        return blocked("Topic not allowed")

    # Check for PII
    if contains_pii(user_input):
        return sanitize_pii(user_input)

    return allowed(user_input)

Rate Limiting

Prevent abuse through excessive requests.

Authentication

Verify user identity and permissions.

Output Guardrails

Response Validation

def output_guardrail(agent_response):
    # Check for hallucinated facts
    if confidence_too_low(agent_response):
        return add_uncertainty_disclaimer(agent_response)

    # Check for harmful content
    if contains_harmful_content(agent_response):
        return blocked("Response contains harmful content")

    # Check for data leakage
    if contains_sensitive_data(agent_response):
        return redact_sensitive_data(agent_response)

    return allowed(agent_response)

Action Validation

For tool-using agents:

def action_guardrail(proposed_action):
    # Check against allowlist
    if action.tool not in ALLOWED_TOOLS:
        return blocked("Tool not permitted")

    # Check parameters
    if not validate_parameters(action):
        return blocked("Invalid parameters")

    # Check for dangerous operations
    if is_destructive_action(action):
        return require_human_approval(action)

    return allowed(action)

OpenAI Agents SDK Guardrails

The OpenAI Agents SDK includes guardrails as a core primitive:

from openai_agents import Agent, InputGuardrail, OutputGuardrail

agent = Agent(
    name="assistant",
    input_guardrails=[
        InputGuardrail(check_injection),
        InputGuardrail(check_topic_policy)
    ],
    output_guardrails=[
        OutputGuardrail(check_harmful_content),
        OutputGuardrail(check_hallucination)
    ]
)

Guardrail Categories

Safety Guardrails

Harmful content detection
Violence/self-harm prevention
Illegal activity blocking

Compliance Guardrails

PII/PHI protection (HIPAA, GDPR)
Financial advice disclaimers
Industry-specific regulations

Quality Guardrails

Hallucination detection
Factual accuracy checks
Consistency validation

Security Guardrails

Prompt injection detection
Data exfiltration prevention
Access control enforcement

Implementation Tips

Run guardrails in parallel where possible
Log all guardrail activations for analysis
Tune sensitivity to balance safety vs. usability
Update guardrails based on new attack patterns

Want to learn more patterns?

Explore Learning Paths

Considerations

Guardrails add latency and can create false positives. Balance protection level against user experience.

NextMutual Verification Pattern

Dimension Scores

Safety

5/5

Accuracy

3/5

Cost

3/5

Speed

3/5

Implementation

Complexitymedium

Implementation Checklist

Policy definitions

Content classifiers

Logging infrastructure

0/3 complete

Guardrails Pattern

Overview

The Challenge

The Solution

When to Use

When NOT to Use

Trade-offs

Advantages

Considerations

Deep Dive

Overview

Architecture

Input Guardrails

Content Filtering

Rate Limiting

Authentication

Output Guardrails

Response Validation

Action Validation

OpenAI Agents SDK Guardrails

Guardrail Categories

Safety Guardrails

Compliance Guardrails

Quality Guardrails

Security Guardrails

Implementation Tips

Considerations

Implementation

Tags