Output Guardrails: Preventing AI Agents from Returning Harmful Content

Why Input Validation Is Not Enough

Even with robust input validation, an AI agent can still produce harmful outputs. The model might hallucinate sensitive data, generate toxic content from benign prompts, leak system prompt details, or return responses that violate your application's business rules. Output guardrails are the last line of defense between the agent and your users.

This post builds a complete output guardrail system in Python with four types of checks: PII detection, toxicity filtering, format validation, and topic adherence.

Output Guardrail Architecture

The guardrail system mirrors the input validation pipeline but runs on the agent's response before it is delivered:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import re

class GuardrailAction(Enum):
    ALLOW = "allow"
    REDACT = "redact"
    BLOCK = "block"

@dataclass
class GuardrailResult:
    action: GuardrailAction
    output: str
    violations: list[str] = field(default_factory=list)
    blocked_reason: Optional[str] = None

class OutputGuardrailPipeline:
    def __init__(self, guardrails: list):
        self.guardrails = guardrails

    def evaluate(self, agent_output: str) -> GuardrailResult:
        current_output = agent_output
        all_violations = []

        for guardrail in self.guardrails:
            result = guardrail.check(current_output)
            all_violations.extend(result.violations)

            if result.action == GuardrailAction.BLOCK:
                return GuardrailResult(
                    action=GuardrailAction.BLOCK,
                    output="",
                    violations=all_violations,
                    blocked_reason=result.blocked_reason,
                )

            if result.action == GuardrailAction.REDACT:
                current_output = result.output

        action = (
            GuardrailAction.REDACT if all_violations
            else GuardrailAction.ALLOW
        )
        return GuardrailResult(
            action=action,
            output=current_output,
            violations=all_violations,
        )

Guardrail 1: PII Detection and Redaction

PII leaks are one of the highest-risk output failures. An agent might include email addresses, phone numbers, or social security numbers from its training data or retrieved documents:

class PIIGuardrail:
    """Detect and redact personally identifiable information."""

    PII_PATTERNS = {
        "email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        "phone_us": r"\b(?:\+1[-.]?)?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}\b",
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
        "credit_card": r"\b(?:\d{4}[-\s]?){3}\d{4}\b",
        "ip_address": r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",
    }

    REDACTION_MAP = {
        "email": "[EMAIL REDACTED]",
        "phone_us": "[PHONE REDACTED]",
        "ssn": "[SSN REDACTED]",
        "credit_card": "[CARD REDACTED]",
        "ip_address": "[IP REDACTED]",
    }

    def check(self, text: str) -> GuardrailResult:
        violations = []
        redacted = text

        for pii_type, pattern in self.PII_PATTERNS.items():
            matches = re.findall(pattern, redacted)
            if matches:
                violations.append(f"pii_{pii_type}:{len(matches)}_instances")
                replacement = self.REDACTION_MAP[pii_type]
                redacted = re.sub(pattern, replacement, redacted)

        if violations:
            return GuardrailResult(
                action=GuardrailAction.REDACT,
                output=redacted,
                violations=violations,
            )

        return GuardrailResult(
            action=GuardrailAction.ALLOW,
            output=text,
        )

Guardrail 2: Toxicity and Harmful Content Filter

Toxicity detection prevents the agent from outputting offensive, violent, or otherwise harmful content:

class ToxicityGuardrail:
    """Detect toxic or harmful content in agent output."""

    def __init__(self, threshold: float = 0.7):
        self.threshold = threshold

    def check(self, text: str) -> GuardrailResult:
        from openai import OpenAI

        client = OpenAI()
        response = client.moderations.create(
            model="omni-moderation-latest",
            input=text,
        )

        result = response.results[0]

        if result.flagged:
            flagged_categories = [
                cat for cat, flagged in result.categories.__dict__.items()
                if flagged
            ]
            return GuardrailResult(
                action=GuardrailAction.BLOCK,
                output="",
                violations=[f"toxicity:{cat}" for cat in flagged_categories],
                blocked_reason="Response contained harmful content",
            )

        return GuardrailResult(
            action=GuardrailAction.ALLOW,
            output=text,
        )

Guardrail 3: Format and Schema Validation

When agents return structured data, format validation ensures correctness:

import json
from typing import Any

class FormatGuardrail:
    """Validate that agent output conforms to expected schema."""

    def __init__(self, expected_format: str = "text", schema: dict | None = None):
        self.expected_format = expected_format
        self.schema = schema

    def check(self, text: str) -> GuardrailResult:
        if self.expected_format == "json":
            return self._validate_json(text)
        elif self.expected_format == "no_code":
            return self._validate_no_code(text)
        return GuardrailResult(action=GuardrailAction.ALLOW, output=text)

    def _validate_json(self, text: str) -> GuardrailResult:
        try:
            parsed = json.loads(text)
            if self.schema:
                missing = [k for k in self.schema.get("required", []) if k not in parsed]
                if missing:
                    return GuardrailResult(
                        action=GuardrailAction.BLOCK,
                        output=text,
                        violations=[f"missing_fields:{missing}"],
                        blocked_reason="Response missing required fields",
                    )
        except json.JSONDecodeError:
            return GuardrailResult(
                action=GuardrailAction.BLOCK,
                output=text,
                violations=["invalid_json"],
                blocked_reason="Response is not valid JSON",
            )

        return GuardrailResult(action=GuardrailAction.ALLOW, output=text)

    def _validate_no_code(self, text: str) -> GuardrailResult:
        code_patterns = [r"~~~", r"import\s+\w+", r"def\s+\w+\("]
        for pattern in code_patterns:
            if re.search(pattern, text):
                return GuardrailResult(
                    action=GuardrailAction.BLOCK,
                    output=text,
                    violations=["contains_code"],
                    blocked_reason="Response contains code blocks",
                )
        return GuardrailResult(action=GuardrailAction.ALLOW, output=text)

Guardrail 4: Topic Adherence

Ensure the agent stays on topic and does not reveal system internals:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

class TopicAdherenceGuardrail:
    """Block responses that leak system prompts or go off-topic."""

    SYSTEM_LEAK_PATTERNS = [
        r"my (system |initial )?instructions (are|say|tell)",
        r"I was (told|instructed|programmed) to",
        r"my (system )?prompt (is|says|contains)",
    ]

    def __init__(self, allowed_topics: list[str] | None = None):
        self.allowed_topics = allowed_topics

    def check(self, text: str) -> GuardrailResult:
        violations = []

        for pattern in self.SYSTEM_LEAK_PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                violations.append("system_prompt_leak")
                return GuardrailResult(
                    action=GuardrailAction.BLOCK,
                    output="",
                    violations=violations,
                    blocked_reason="Response may reveal system instructions",
                )

        return GuardrailResult(
            action=GuardrailAction.ALLOW,
            output=text,
            violations=violations,
        )

Assembling the Pipeline

guardrails = OutputGuardrailPipeline(guardrails=[
    PIIGuardrail(),
    ToxicityGuardrail(),
    TopicAdherenceGuardrail(),
    FormatGuardrail(expected_format="text"),
])

def deliver_response(agent_output: str) -> str:
    result = guardrails.evaluate(agent_output)

    if result.action == GuardrailAction.BLOCK:
        return "I'm unable to provide that response. Please rephrase your question."

    return result.output

FAQ

Do output guardrails add noticeable latency?

Regex-based checks like PII detection add microseconds. LLM-based checks like toxicity scoring and topic classification add 200-500ms per call. The best strategy is to run fast regex checks first and only invoke LLM-based guardrails when the fast checks pass. For latency-sensitive applications, you can run guardrail checks in parallel with response streaming and cancel the stream if a violation is detected.

Should I block or redact PII in agent outputs?

It depends on context. For customer-facing applications, redaction is often better because it preserves the useful parts of the response while removing sensitive data. For internal tools where the user might need the data, logging the PII detection and alerting is better than silently redacting. Always log PII detections regardless of the action taken.

How do I handle false positives in output guardrails?

Log every guardrail trigger with the original output, the violation type, and whether the action was block or redact. Review these logs weekly to tune your patterns and thresholds. Build a test suite of known-good outputs that should pass all guardrails and run it as part of your CI pipeline to catch regressions.

#OutputGuardrails #AISafety #PIIDetection #ContentModeration #Python #AgenticAI #LearnAI #AIEngineering

Output Guardrails: Preventing AI Agents from Returning Harmful Content

Why Input Validation Is Not Enough

Output Guardrail Architecture

Guardrail 1: PII Detection and Redaction

Guardrail 2: Toxicity and Harmful Content Filter

Guardrail 3: Format and Schema Validation

Guardrail 4: Topic Adherence

Assembling the Pipeline

FAQ

Do output guardrails add noticeable latency?

Should I block or redact PII in agent outputs?

How do I handle false positives in output guardrails?

Try CallSphere AI Voice Agents

Related Articles You May Like

The Claude Jailbreak Meta-Game: A Field Report from Enterprise Red Teams

Anthropic's Responsible Scaling Policy: Genuine Brake or Sophisticated PR?

The Constitutional AI Origin Myth: Was It Really About Safety, or Differentiation?

Constitutional AI: Genuine Safety Moat or Sophisticated Marketing?

Anthropic and the UK AI Safety Institute: London Research Update

California SB 1047 second wind — what changed in April 2026

Product

Resources

Company

Legal

Industries

Integrations

Solutions

Compare

Pillar Guides

See AI Voice Agents in Action