Skip to content
Learn Agentic AI
Learn Agentic AI12 min read4 views

Output Guardrails: Preventing AI Agents from Returning Harmful Content

Build output scanning systems that detect PII leaks, toxic content, format violations, and off-topic responses before they reach your users, with practical Python implementations for each guardrail type.

Why Input Validation Is Not Enough

Even with robust input validation, an AI agent can still produce harmful outputs. The model might hallucinate sensitive data, generate toxic content from benign prompts, leak system prompt details, or return responses that violate your application's business rules. Output guardrails are the last line of defense between the agent and your users.

This post builds a complete output guardrail system in Python with four types of checks: PII detection, toxicity filtering, format validation, and topic adherence.

Output Guardrail Architecture

The guardrail system mirrors the input validation pipeline but runs on the agent's response before it is delivered:

flowchart TD
    START["Output Guardrails: Preventing AI Agents from Retu…"] --> A
    A["Why Input Validation Is Not Enough"]
    A --> B
    B["Output Guardrail Architecture"]
    B --> C
    C["Guardrail 1: PII Detection and Redaction"]
    C --> D
    D["Guardrail 2: Toxicity and Harmful Conte…"]
    D --> E
    E["Guardrail 3: Format and Schema Validati…"]
    E --> F
    F["Guardrail 4: Topic Adherence"]
    F --> G
    G["Assembling the Pipeline"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import re

class GuardrailAction(Enum):
    ALLOW = "allow"
    REDACT = "redact"
    BLOCK = "block"

@dataclass
class GuardrailResult:
    action: GuardrailAction
    output: str
    violations: list[str] = field(default_factory=list)
    blocked_reason: Optional[str] = None

class OutputGuardrailPipeline:
    def __init__(self, guardrails: list):
        self.guardrails = guardrails

    def evaluate(self, agent_output: str) -> GuardrailResult:
        current_output = agent_output
        all_violations = []

        for guardrail in self.guardrails:
            result = guardrail.check(current_output)
            all_violations.extend(result.violations)

            if result.action == GuardrailAction.BLOCK:
                return GuardrailResult(
                    action=GuardrailAction.BLOCK,
                    output="",
                    violations=all_violations,
                    blocked_reason=result.blocked_reason,
                )

            if result.action == GuardrailAction.REDACT:
                current_output = result.output

        action = (
            GuardrailAction.REDACT if all_violations
            else GuardrailAction.ALLOW
        )
        return GuardrailResult(
            action=action,
            output=current_output,
            violations=all_violations,
        )

Guardrail 1: PII Detection and Redaction

PII leaks are one of the highest-risk output failures. An agent might include email addresses, phone numbers, or social security numbers from its training data or retrieved documents:

class PIIGuardrail:
    """Detect and redact personally identifiable information."""

    PII_PATTERNS = {
        "email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        "phone_us": r"\b(?:\+1[-.]?)?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}\b",
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
        "credit_card": r"\b(?:\d{4}[-\s]?){3}\d{4}\b",
        "ip_address": r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",
    }

    REDACTION_MAP = {
        "email": "[EMAIL REDACTED]",
        "phone_us": "[PHONE REDACTED]",
        "ssn": "[SSN REDACTED]",
        "credit_card": "[CARD REDACTED]",
        "ip_address": "[IP REDACTED]",
    }

    def check(self, text: str) -> GuardrailResult:
        violations = []
        redacted = text

        for pii_type, pattern in self.PII_PATTERNS.items():
            matches = re.findall(pattern, redacted)
            if matches:
                violations.append(f"pii_{pii_type}:{len(matches)}_instances")
                replacement = self.REDACTION_MAP[pii_type]
                redacted = re.sub(pattern, replacement, redacted)

        if violations:
            return GuardrailResult(
                action=GuardrailAction.REDACT,
                output=redacted,
                violations=violations,
            )

        return GuardrailResult(
            action=GuardrailAction.ALLOW,
            output=text,
        )

Guardrail 2: Toxicity and Harmful Content Filter

Toxicity detection prevents the agent from outputting offensive, violent, or otherwise harmful content:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class ToxicityGuardrail:
    """Detect toxic or harmful content in agent output."""

    def __init__(self, threshold: float = 0.7):
        self.threshold = threshold

    def check(self, text: str) -> GuardrailResult:
        from openai import OpenAI

        client = OpenAI()
        response = client.moderations.create(
            model="omni-moderation-latest",
            input=text,
        )

        result = response.results[0]

        if result.flagged:
            flagged_categories = [
                cat for cat, flagged in result.categories.__dict__.items()
                if flagged
            ]
            return GuardrailResult(
                action=GuardrailAction.BLOCK,
                output="",
                violations=[f"toxicity:{cat}" for cat in flagged_categories],
                blocked_reason="Response contained harmful content",
            )

        return GuardrailResult(
            action=GuardrailAction.ALLOW,
            output=text,
        )

Guardrail 3: Format and Schema Validation

When agents return structured data, format validation ensures correctness:

import json
from typing import Any

class FormatGuardrail:
    """Validate that agent output conforms to expected schema."""

    def __init__(self, expected_format: str = "text", schema: dict | None = None):
        self.expected_format = expected_format
        self.schema = schema

    def check(self, text: str) -> GuardrailResult:
        if self.expected_format == "json":
            return self._validate_json(text)
        elif self.expected_format == "no_code":
            return self._validate_no_code(text)
        return GuardrailResult(action=GuardrailAction.ALLOW, output=text)

    def _validate_json(self, text: str) -> GuardrailResult:
        try:
            parsed = json.loads(text)
            if self.schema:
                missing = [k for k in self.schema.get("required", []) if k not in parsed]
                if missing:
                    return GuardrailResult(
                        action=GuardrailAction.BLOCK,
                        output=text,
                        violations=[f"missing_fields:{missing}"],
                        blocked_reason="Response missing required fields",
                    )
        except json.JSONDecodeError:
            return GuardrailResult(
                action=GuardrailAction.BLOCK,
                output=text,
                violations=["invalid_json"],
                blocked_reason="Response is not valid JSON",
            )

        return GuardrailResult(action=GuardrailAction.ALLOW, output=text)

    def _validate_no_code(self, text: str) -> GuardrailResult:
        code_patterns = [r"~~~", r"import\s+\w+", r"def\s+\w+\("]
        for pattern in code_patterns:
            if re.search(pattern, text):
                return GuardrailResult(
                    action=GuardrailAction.BLOCK,
                    output=text,
                    violations=["contains_code"],
                    blocked_reason="Response contains code blocks",
                )
        return GuardrailResult(action=GuardrailAction.ALLOW, output=text)

Guardrail 4: Topic Adherence

Ensure the agent stays on topic and does not reveal system internals:

class TopicAdherenceGuardrail:
    """Block responses that leak system prompts or go off-topic."""

    SYSTEM_LEAK_PATTERNS = [
        r"my (system |initial )?instructions (are|say|tell)",
        r"I was (told|instructed|programmed) to",
        r"my (system )?prompt (is|says|contains)",
    ]

    def __init__(self, allowed_topics: list[str] | None = None):
        self.allowed_topics = allowed_topics

    def check(self, text: str) -> GuardrailResult:
        violations = []

        for pattern in self.SYSTEM_LEAK_PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                violations.append("system_prompt_leak")
                return GuardrailResult(
                    action=GuardrailAction.BLOCK,
                    output="",
                    violations=violations,
                    blocked_reason="Response may reveal system instructions",
                )

        return GuardrailResult(
            action=GuardrailAction.ALLOW,
            output=text,
            violations=violations,
        )

Assembling the Pipeline

guardrails = OutputGuardrailPipeline(guardrails=[
    PIIGuardrail(),
    ToxicityGuardrail(),
    TopicAdherenceGuardrail(),
    FormatGuardrail(expected_format="text"),
])

def deliver_response(agent_output: str) -> str:
    result = guardrails.evaluate(agent_output)

    if result.action == GuardrailAction.BLOCK:
        return "I'm unable to provide that response. Please rephrase your question."

    return result.output

FAQ

Do output guardrails add noticeable latency?

Regex-based checks like PII detection add microseconds. LLM-based checks like toxicity scoring and topic classification add 200-500ms per call. The best strategy is to run fast regex checks first and only invoke LLM-based guardrails when the fast checks pass. For latency-sensitive applications, you can run guardrail checks in parallel with response streaming and cancel the stream if a violation is detected.

Should I block or redact PII in agent outputs?

It depends on context. For customer-facing applications, redaction is often better because it preserves the useful parts of the response while removing sensitive data. For internal tools where the user might need the data, logging the PII detection and alerting is better than silently redacting. Always log PII detections regardless of the action taken.

How do I handle false positives in output guardrails?

Log every guardrail trigger with the original output, the violation type, and whether the action was block or redact. Review these logs weekly to tune your patterns and thresholds. Build a test suite of known-good outputs that should pass all guardrails and run it as part of your CI pipeline to catch regressions.


#OutputGuardrails #AISafety #PIIDetection #ContentModeration #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

AI Agent Safety Research 2026: Alignment, Sandboxing, and Constitutional AI for Agents

Current state of AI agent safety research covering alignment techniques, sandbox environments, constitutional AI applied to agents, and red-teaming methodologies.

AI Interview Prep

6 AI Safety & Alignment Interview Questions From Anthropic & OpenAI (2026)

Real AI safety and alignment interview questions from Anthropic and OpenAI in 2026. Covers alignment challenges, RLHF vs DPO, responsible scaling, red-teaming, safety-first decisions, and autonomous agent oversight.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.

Learn Agentic AI

OpenAI Agents SDK in 2026: Building Multi-Agent Systems with Handoffs and Guardrails

Complete tutorial on the OpenAI Agents SDK covering agent creation, tool definitions, handoff patterns between specialist agents, and input/output guardrails for safe AI systems.