Content Security Policies for AI Agents: Preventing Malicious Output Generation

Why Output Filtering Is Non-Negotiable

An AI agent can generate any text the underlying LLM is capable of producing. Without output filtering, agents can leak private data, generate harmful instructions, produce policy-violating content, or output executable code that acts as a cross-site scripting payload when rendered in a browser.

Content security for AI agents operates on a different model than traditional web content security policies. Instead of restricting which resources a browser can load, agent content security restricts what the agent can say. The enforcement point sits between the LLM's raw output and the delivery layer that sends responses to users.

Layered Filtering Architecture

Build your content security as a pipeline of filters that each response must pass through. If any filter rejects the response, it is blocked or sanitized before delivery:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

from abc import ABC, abstractmethod
from dataclasses import dataclass
from enum import Enum

class FilterVerdict(Enum):
    PASS = "pass"
    BLOCK = "block"
    SANITIZE = "sanitize"

@dataclass
class FilterResult:
    verdict: FilterVerdict
    filter_name: str
    reason: str
    sanitized_content: str | None = None

class ContentFilter(ABC):
    """Base class for content security filters."""

    @abstractmethod
    def evaluate(self, content: str, context: dict) -> FilterResult:
        ...

class ContentSecurityPipeline:
    """Runs agent output through a chain of content filters."""

    def __init__(self):
        self.filters: list[ContentFilter] = []

    def add_filter(self, f: ContentFilter) -> None:
        self.filters.append(f)

    def process(self, content: str, context: dict | None = None) -> tuple[str, list[FilterResult]]:
        """Process content through all filters.
        Returns (final_content, filter_results)."""
        ctx = context or {}
        results = []
        current_content = content

        for f in self.filters:
            result = f.evaluate(current_content, ctx)
            results.append(result)

            if result.verdict == FilterVerdict.BLOCK:
                return (
                    "I cannot provide that information.",
                    results,
                )

            if result.verdict == FilterVerdict.SANITIZE and result.sanitized_content:
                current_content = result.sanitized_content

        return current_content, results

Pattern-Based Filtering

Use regex patterns to catch common dangerous outputs like PII, credentials, and code injection attempts:

import re

class PatternFilter(ContentFilter):
    """Blocks or sanitizes content matching dangerous patterns."""

    PATTERNS = {
        "ssn": {
            "pattern": r"\b\d{3}-\d{2}-\d{4}\b",
            "action": FilterVerdict.SANITIZE,
            "replacement": "[SSN REDACTED]",
            "reason": "Social Security Number detected",
        },
        "credit_card": {
            "pattern": r"\b(?:\d{4}[- ]?){3}\d{4}\b",
            "action": FilterVerdict.SANITIZE,
            "replacement": "[CARD REDACTED]",
            "reason": "Credit card number detected",
        },
        "api_key": {
            "pattern": r"\b(sk-[a-zA-Z0-9]{32,}|AKIA[0-9A-Z]{16})\b",
            "action": FilterVerdict.BLOCK,
            "replacement": "",
            "reason": "API key or credential detected",
        },
        "script_injection": {
            "pattern": r"<script[^>]*>.*?</script>",
            "action": FilterVerdict.SANITIZE,
            "replacement": "[SCRIPT REMOVED]",
            "reason": "Script injection detected",
        },
    }

    def evaluate(self, content: str, context: dict) -> FilterResult:
        for name, config in self.PATTERNS.items():
            match = re.search(config["pattern"], content, re.IGNORECASE | re.DOTALL)
            if match:
                if config["action"] == FilterVerdict.BLOCK:
                    return FilterResult(
                        verdict=FilterVerdict.BLOCK,
                        filter_name=f"pattern:{name}",
                        reason=config["reason"],
                    )

                sanitized = re.sub(
                    config["pattern"],
                    config["replacement"],
                    content,
                    flags=re.IGNORECASE | re.DOTALL,
                )
                return FilterResult(
                    verdict=FilterVerdict.SANITIZE,
                    filter_name=f"pattern:{name}",
                    reason=config["reason"],
                    sanitized_content=sanitized,
                )

        return FilterResult(
            verdict=FilterVerdict.PASS,
            filter_name="pattern",
            reason="No dangerous patterns detected",
        )

Allowlist-Based Output Control

For high-security environments, define exactly what the agent is allowed to output rather than trying to block everything dangerous:

class TopicAllowlistFilter(ContentFilter):
    """Restricts agent output to pre-approved topics."""

    def __init__(self, allowed_topics: list[str], classifier_fn=None):
        self.allowed_topics = set(allowed_topics)
        self.classifier_fn = classifier_fn or self._default_classifier

    def _default_classifier(self, content: str) -> list[str]:
        """Simple keyword-based topic classification."""
        topic_keywords = {
            "product_info": ["product", "feature", "pricing", "plan"],
            "support": ["help", "issue", "error", "troubleshoot"],
            "billing": ["invoice", "payment", "subscription", "charge"],
        }
        detected = []
        content_lower = content.lower()
        for topic, keywords in topic_keywords.items():
            if any(kw in content_lower for kw in keywords):
                detected.append(topic)
        return detected if detected else ["unknown"]

    def evaluate(self, content: str, context: dict) -> FilterResult:
        detected_topics = self.classifier_fn(content)

        for topic in detected_topics:
            if topic not in self.allowed_topics:
                return FilterResult(
                    verdict=FilterVerdict.BLOCK,
                    filter_name="topic_allowlist",
                    reason=f"Topic '{topic}' not in allowlist",
                )

        return FilterResult(
            verdict=FilterVerdict.PASS,
            filter_name="topic_allowlist",
            reason="All topics within allowed set",
        )

Structured Output Validation

Enforce output schemas that make it structurally impossible for the agent to produce certain types of content:

from pydantic import BaseModel, field_validator

class SafeAgentResponse(BaseModel):
    """Validated agent response that prevents dangerous outputs."""
    message: str
    sources: list[str]
    confidence: float

    @field_validator("message")
    @classmethod
    def validate_message(cls, v: str) -> str:
        # Reject responses containing HTML tags
        if re.search(r"<[a-zA-Z][^>]*>", v):
            raise ValueError("Response must not contain HTML tags")

        # Reject responses exceeding length limit
        if len(v) > 5000:
            raise ValueError("Response exceeds maximum length")

        return v

    @field_validator("confidence")
    @classmethod
    def validate_confidence(cls, v: float) -> float:
        if not 0.0 <= v <= 1.0:
            raise ValueError("Confidence must be between 0 and 1")
        return v

# Usage in pipeline
pipeline = ContentSecurityPipeline()
pipeline.add_filter(PatternFilter())
pipeline.add_filter(TopicAllowlistFilter(
    allowed_topics=["product_info", "support", "billing"]
))

raw_output = "Your API key is sk-abc123def456. Your next bill is $49."
safe_output, results = pipeline.process(raw_output)

FAQ

How do I handle false positives in pattern-based filtering?

Track your false positive rate by logging all filter verdicts and reviewing blocked responses. Tune your patterns to be more specific — for example, use a Luhn check for credit card numbers rather than just matching digit patterns. Implement a review queue where blocked responses can be manually approved, and feed those approvals back into pattern refinement.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Should I filter tool call outputs or only final responses?

Filter both. Tool call outputs can contain injected content that influences the agent's subsequent reasoning. Final responses are what users see. Apply the full security pipeline to tool outputs as they are ingested, and apply it again to the agent's final response before delivery.

How does output filtering interact with streaming responses?

Streaming complicates content security because you cannot analyze the full response before sending tokens to the user. Buffer a configurable amount of text (for example, sentence boundaries) and run filters on each buffer before flushing to the client. For pattern-based filters, maintain state across buffers to detect patterns that span chunk boundaries.

#ContentSecurity #OutputFiltering #AISafety #ContentModeration #AgentGuardrails #AgenticAI #LearnAI #AIEngineering

Content Security Policies for AI Agents: Preventing Malicious Output Generation

Why Output Filtering Is Non-Negotiable

Layered Filtering Architecture

Pattern-Based Filtering

Allowlist-Based Output Control

Structured Output Validation

FAQ

How do I handle false positives in pattern-based filtering?

Should I filter tool call outputs or only final responses?

How does output filtering interact with streaming responses?

Try CallSphere AI Voice Agents

Related Articles You May Like

Safety Evaluation for Agents: Jailbreak, Prompt Injection, and Tool-Misuse Test Suites in 2026

Input and Output Guardrails in the OpenAI Agents SDK: A Production Pattern (2026)

Anthropic's Responsible Scaling Policy: Genuine Brake or Sophisticated PR?

The Constitutional AI Origin Myth: Was It Really About Safety, or Differentiation?

Constitutional AI: Genuine Safety Moat or Sophisticated Marketing?

The Claude Jailbreak Meta-Game: A Field Report from Enterprise Red Teams