Skip to content
Learn Agentic AI
Learn Agentic AI11 min read5 views

Output Guardrails: Ensuring Safe Agent Responses

Learn how to implement output guardrails in the OpenAI Agents SDK to inspect, validate, and block unsafe agent responses before they reach end users — including PII detection and compliance filtering.

Why Output Guardrails Exist

Input guardrails protect against bad requests. Output guardrails protect against bad responses. Even with a well-designed prompt and input validation, an LLM can produce outputs that violate your policies — leaking internal data, generating PII, returning hallucinated legal advice, or producing content that does not meet compliance standards.

Output guardrails in the OpenAI Agents SDK run after the agent completes its response but before that response is returned to the caller. They give you a final checkpoint to inspect, validate, and potentially block the agent's output.

The pattern mirrors input guardrails: you define a guardrail function, it returns a GuardrailFunctionOutput, and if the tripwire is triggered, the SDK raises an exception. The key difference is that output guardrails receive the agent's generated output rather than the user's input.

Basic Output Guardrail Structure

An output guardrail function receives the agent's output and evaluates it against your policies.

flowchart TD
    START["Output Guardrails: Ensuring Safe Agent Responses"] --> A
    A["Why Output Guardrails Exist"]
    A --> B
    B["Basic Output Guardrail Structure"]
    B --> C
    C["Catching OutputGuardrailTripwireTrigger…"]
    C --> D
    D["LLM-Based Output Guardrails"]
    D --> E
    E["PII Detection: A Complete Example"]
    E --> F
    F["Output Guardrails with Structured Output"]
    F --> G
    G["Combining Input and Output Guardrails"]
    G --> H
    H["Performance Considerations"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from agents import Agent, Runner, OutputGuardrail, GuardrailFunctionOutput
from pydantic import BaseModel
import asyncio
import re

async def pii_guardrail(ctx, agent, output) -> GuardrailFunctionOutput:
    """Check agent output for personally identifiable information."""
    output_text = str(output)

    # Check for common PII patterns
    ssn_pattern = r"d{3}-d{2}-d{4}"
    email_pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}"
    phone_pattern = r"d{3}[-.]?d{3}[-.]?d{4}"

    findings = {
        "ssn_found": bool(re.search(ssn_pattern, output_text)),
        "email_found": bool(re.search(email_pattern, output_text)),
        "phone_found": bool(re.search(phone_pattern, output_text)),
    }

    has_pii = any(findings.values())

    return GuardrailFunctionOutput(
        output_info=findings,
        tripwire_triggered=has_pii,
    )

support_agent = Agent(
    name="SupportAgent",
    instructions="""You are a customer support agent. Help users with
    their account issues. NEVER include SSNs, full email addresses,
    or phone numbers in your responses — use masked versions instead.""",
    model="gpt-4o",
    output_guardrails=[
        OutputGuardrail(guardrail_function=pii_guardrail),
    ],
)

Even though the instructions tell the agent to avoid PII, you cannot rely on prompt instructions alone. The output guardrail acts as an enforced policy — it catches what the model misses.

Catching OutputGuardrailTripwireTriggered

When an output guardrail trips, the SDK raises OutputGuardrailTripwireTriggered. This is your signal to suppress the response and return a safe alternative.

from agents.exceptions import OutputGuardrailTripwireTriggered

async def handle_user_message(user_input: str) -> str:
    try:
        result = await Runner.run(support_agent, user_input)
        return result.final_output
    except OutputGuardrailTripwireTriggered as e:
        guardrail_info = e.guardrail_result.output_info
        # Log for compliance audit
        log_pii_violation(
            user_input=user_input,
            guardrail_findings=guardrail_info,
            timestamp=datetime.utcnow(),
        )

        return (
            "I apologize, but I am unable to share that information "
            "in this format. Please contact our support team directly "
            "for assistance with sensitive account details."
        )

The critical point: the unsafe output is never returned to the user. The result.final_output that contained PII is discarded when the exception fires. Your application returns a safe, generic message instead.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

LLM-Based Output Guardrails

Regex patterns catch structured PII, but many compliance requirements are semantic. For example, detecting whether a response contains medical advice, financial recommendations, or legal opinions requires an LLM to understand context.

class ComplianceCheckOutput(BaseModel):
    is_compliant: bool
    violation_type: str | None = None
    explanation: str

compliance_checker = Agent(
    name="ComplianceChecker",
    instructions="""Evaluate the given text for compliance violations.
    Flag if the text contains:
    - Medical diagnoses or treatment recommendations
    - Specific financial or investment advice
    - Legal opinions presented as fact
    - Promises or guarantees about outcomes
    Return is_compliant=True if none of these are present.""",
    model="gpt-4o-mini",
    output_type=ComplianceCheckOutput,
)

async def compliance_guardrail(ctx, agent, output) -> GuardrailFunctionOutput:
    result = await Runner.run(compliance_checker, str(output), context=ctx.context)
    return GuardrailFunctionOutput(
        output_info=result.final_output.model_dump(),
        tripwire_triggered=not result.final_output.is_compliant,
    )

This pattern uses a small, fast model (gpt-4o-mini) as the compliance checker. It evaluates the main agent's full response and flags violations that no regex could catch — like "You should definitely invest in index funds right now" being flagged as financial advice.

PII Detection: A Complete Example

Here is a production-grade PII detection guardrail that combines regex patterns with an LLM-based check for contextual PII (names, addresses, and other information that is PII only in context).

import re
from agents import Agent, Runner, OutputGuardrail, GuardrailFunctionOutput
from pydantic import BaseModel

class PIIAnalysis(BaseModel):
    contains_pii: bool
    pii_types: list[str]
    confidence: float

pii_detector = Agent(
    name="PIIDetector",
    instructions="""Analyze the text for personally identifiable
    information. Check for: full names paired with account details,
    physical addresses, dates of birth in context, medical record
    numbers, any data that could identify a specific individual.
    Report confidence as a float between 0 and 1.""",
    model="gpt-4o-mini",
    output_type=PIIAnalysis,
)

REGEX_PATTERNS = {
    "ssn": r"d{3}-d{2}-d{4}",
    "credit_card": r"d{4}[-s]?d{4}[-s]?d{4}[-s]?d{4}",
    "email": r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}",
    "phone_us": r"(+1[-s]?)?(?d{3})?[-s.]?d{3}[-s.]?d{4}",
    "ip_address": r"d{1,3}.d{1,3}.d{1,3}.d{1,3}",
}

async def comprehensive_pii_guardrail(ctx, agent, output) -> GuardrailFunctionOutput:
    output_text = str(output)

    # Layer 1: Fast regex scan
    regex_findings = {}
    for pii_type, pattern in REGEX_PATTERNS.items():
        matches = re.findall(pattern, output_text)
        if matches:
            regex_findings[pii_type] = len(matches)

    # If regex finds PII, trip immediately — no need for LLM check
    if regex_findings:
        return GuardrailFunctionOutput(
            output_info={"method": "regex", "findings": regex_findings},
            tripwire_triggered=True,
        )

    # Layer 2: LLM-based contextual PII check
    result = await Runner.run(pii_detector, output_text, context=ctx.context)
    analysis = result.final_output

    return GuardrailFunctionOutput(
        output_info={
            "method": "llm",
            "contains_pii": analysis.contains_pii,
            "pii_types": analysis.pii_types,
            "confidence": analysis.confidence,
        },
        tripwire_triggered=analysis.contains_pii and analysis.confidence > 0.7,
    )

This two-layer approach is both fast and thorough. Regex catches structured PII instantly without any LLM cost. The LLM layer only runs when regex finds nothing, catching contextual PII that patterns miss.

Output Guardrails with Structured Output

When your agent uses output_type to return structured data (a Pydantic model), the output guardrail receives the parsed object, not raw text. This makes validation even more precise.

class CustomerResponse(BaseModel):
    message: str
    suggested_actions: list[str]
    internal_notes: str | None = None

async def no_internal_notes_guardrail(ctx, agent, output) -> GuardrailFunctionOutput:
    """Ensure internal notes are never populated in the response."""
    if isinstance(output, CustomerResponse) and output.internal_notes:
        return GuardrailFunctionOutput(
            output_info={"violation": "internal_notes_populated"},
            tripwire_triggered=True,
        )
    return GuardrailFunctionOutput(
        output_info={"status": "clean"},
        tripwire_triggered=False,
    )

Combining Input and Output Guardrails

A defense-in-depth strategy uses both guardrail types. Input guardrails block bad requests early. Output guardrails catch any issues that slip through the agent's processing.

production_agent = Agent(
    name="ProductionAgent",
    instructions="You are a helpful assistant for Acme Corp customers.",
    model="gpt-4o",
    input_guardrails=[
        InputGuardrail(guardrail_function=topic_guardrail),
        InputGuardrail(guardrail_function=injection_guardrail),
    ],
    output_guardrails=[
        OutputGuardrail(guardrail_function=comprehensive_pii_guardrail),
        OutputGuardrail(guardrail_function=compliance_guardrail),
        OutputGuardrail(guardrail_function=no_internal_notes_guardrail),
    ],
)

Input guardrails save money by rejecting bad input before the main agent runs. Output guardrails save reputation by catching bad output before the user sees it. Both are necessary. Neither alone is sufficient.

Performance Considerations

Output guardrails add latency to every successful response. The user waits for both the agent and the guardrail to finish. To minimize impact:

Use regex and heuristic checks first. Only call an LLM-based guardrail when cheap checks pass. Keep guardrail agents on fast models like gpt-4o-mini. Run multiple output guardrails in parallel when they are independent. And measure: track the p50, p95, and p99 latency that guardrails add so you can make informed trade-offs between safety and speed.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.