Skip to content
Learn Agentic AI
Learn Agentic AI11 min read1 views

Content Security Policies for AI Agents: Preventing Malicious Output Generation

Build robust output filtering systems for AI agents using allowlists, blocklists, regex patterns, ML classifiers, and structured output validation to prevent harmful, toxic, or policy-violating content from reaching end users.

Why Output Filtering Is Non-Negotiable

An AI agent can generate any text the underlying LLM is capable of producing. Without output filtering, agents can leak private data, generate harmful instructions, produce policy-violating content, or output executable code that acts as a cross-site scripting payload when rendered in a browser.

Content security for AI agents operates on a different model than traditional web content security policies. Instead of restricting which resources a browser can load, agent content security restricts what the agent can say. The enforcement point sits between the LLM's raw output and the delivery layer that sends responses to users.

Layered Filtering Architecture

Build your content security as a pipeline of filters that each response must pass through. If any filter rejects the response, it is blocked or sanitized before delivery:

flowchart TD
    START["Content Security Policies for AI Agents: Preventi…"] --> A
    A["Why Output Filtering Is Non-Negotiable"]
    A --> B
    B["Layered Filtering Architecture"]
    B --> C
    C["Pattern-Based Filtering"]
    C --> D
    D["Allowlist-Based Output Control"]
    D --> E
    E["Structured Output Validation"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from abc import ABC, abstractmethod
from dataclasses import dataclass
from enum import Enum


class FilterVerdict(Enum):
    PASS = "pass"
    BLOCK = "block"
    SANITIZE = "sanitize"


@dataclass
class FilterResult:
    verdict: FilterVerdict
    filter_name: str
    reason: str
    sanitized_content: str | None = None


class ContentFilter(ABC):
    """Base class for content security filters."""

    @abstractmethod
    def evaluate(self, content: str, context: dict) -> FilterResult:
        ...


class ContentSecurityPipeline:
    """Runs agent output through a chain of content filters."""

    def __init__(self):
        self.filters: list[ContentFilter] = []

    def add_filter(self, f: ContentFilter) -> None:
        self.filters.append(f)

    def process(self, content: str, context: dict | None = None) -> tuple[str, list[FilterResult]]:
        """Process content through all filters.
        Returns (final_content, filter_results)."""
        ctx = context or {}
        results = []
        current_content = content

        for f in self.filters:
            result = f.evaluate(current_content, ctx)
            results.append(result)

            if result.verdict == FilterVerdict.BLOCK:
                return (
                    "I cannot provide that information.",
                    results,
                )

            if result.verdict == FilterVerdict.SANITIZE and result.sanitized_content:
                current_content = result.sanitized_content

        return current_content, results

Pattern-Based Filtering

Use regex patterns to catch common dangerous outputs like PII, credentials, and code injection attempts:

import re


class PatternFilter(ContentFilter):
    """Blocks or sanitizes content matching dangerous patterns."""

    PATTERNS = {
        "ssn": {
            "pattern": r"\b\d{3}-\d{2}-\d{4}\b",
            "action": FilterVerdict.SANITIZE,
            "replacement": "[SSN REDACTED]",
            "reason": "Social Security Number detected",
        },
        "credit_card": {
            "pattern": r"\b(?:\d{4}[- ]?){3}\d{4}\b",
            "action": FilterVerdict.SANITIZE,
            "replacement": "[CARD REDACTED]",
            "reason": "Credit card number detected",
        },
        "api_key": {
            "pattern": r"\b(sk-[a-zA-Z0-9]{32,}|AKIA[0-9A-Z]{16})\b",
            "action": FilterVerdict.BLOCK,
            "replacement": "",
            "reason": "API key or credential detected",
        },
        "script_injection": {
            "pattern": r"<script[^>]*>.*?</script>",
            "action": FilterVerdict.SANITIZE,
            "replacement": "[SCRIPT REMOVED]",
            "reason": "Script injection detected",
        },
    }

    def evaluate(self, content: str, context: dict) -> FilterResult:
        for name, config in self.PATTERNS.items():
            match = re.search(config["pattern"], content, re.IGNORECASE | re.DOTALL)
            if match:
                if config["action"] == FilterVerdict.BLOCK:
                    return FilterResult(
                        verdict=FilterVerdict.BLOCK,
                        filter_name=f"pattern:{name}",
                        reason=config["reason"],
                    )

                sanitized = re.sub(
                    config["pattern"],
                    config["replacement"],
                    content,
                    flags=re.IGNORECASE | re.DOTALL,
                )
                return FilterResult(
                    verdict=FilterVerdict.SANITIZE,
                    filter_name=f"pattern:{name}",
                    reason=config["reason"],
                    sanitized_content=sanitized,
                )

        return FilterResult(
            verdict=FilterVerdict.PASS,
            filter_name="pattern",
            reason="No dangerous patterns detected",
        )

Allowlist-Based Output Control

For high-security environments, define exactly what the agent is allowed to output rather than trying to block everything dangerous:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class TopicAllowlistFilter(ContentFilter):
    """Restricts agent output to pre-approved topics."""

    def __init__(self, allowed_topics: list[str], classifier_fn=None):
        self.allowed_topics = set(allowed_topics)
        self.classifier_fn = classifier_fn or self._default_classifier

    def _default_classifier(self, content: str) -> list[str]:
        """Simple keyword-based topic classification."""
        topic_keywords = {
            "product_info": ["product", "feature", "pricing", "plan"],
            "support": ["help", "issue", "error", "troubleshoot"],
            "billing": ["invoice", "payment", "subscription", "charge"],
        }
        detected = []
        content_lower = content.lower()
        for topic, keywords in topic_keywords.items():
            if any(kw in content_lower for kw in keywords):
                detected.append(topic)
        return detected if detected else ["unknown"]

    def evaluate(self, content: str, context: dict) -> FilterResult:
        detected_topics = self.classifier_fn(content)

        for topic in detected_topics:
            if topic not in self.allowed_topics:
                return FilterResult(
                    verdict=FilterVerdict.BLOCK,
                    filter_name="topic_allowlist",
                    reason=f"Topic '{topic}' not in allowlist",
                )

        return FilterResult(
            verdict=FilterVerdict.PASS,
            filter_name="topic_allowlist",
            reason="All topics within allowed set",
        )

Structured Output Validation

Enforce output schemas that make it structurally impossible for the agent to produce certain types of content:

from pydantic import BaseModel, field_validator


class SafeAgentResponse(BaseModel):
    """Validated agent response that prevents dangerous outputs."""
    message: str
    sources: list[str]
    confidence: float

    @field_validator("message")
    @classmethod
    def validate_message(cls, v: str) -> str:
        # Reject responses containing HTML tags
        if re.search(r"<[a-zA-Z][^>]*>", v):
            raise ValueError("Response must not contain HTML tags")

        # Reject responses exceeding length limit
        if len(v) > 5000:
            raise ValueError("Response exceeds maximum length")

        return v

    @field_validator("confidence")
    @classmethod
    def validate_confidence(cls, v: float) -> float:
        if not 0.0 <= v <= 1.0:
            raise ValueError("Confidence must be between 0 and 1")
        return v


# Usage in pipeline
pipeline = ContentSecurityPipeline()
pipeline.add_filter(PatternFilter())
pipeline.add_filter(TopicAllowlistFilter(
    allowed_topics=["product_info", "support", "billing"]
))

raw_output = "Your API key is sk-abc123def456. Your next bill is $49."
safe_output, results = pipeline.process(raw_output)

FAQ

How do I handle false positives in pattern-based filtering?

Track your false positive rate by logging all filter verdicts and reviewing blocked responses. Tune your patterns to be more specific — for example, use a Luhn check for credit card numbers rather than just matching digit patterns. Implement a review queue where blocked responses can be manually approved, and feed those approvals back into pattern refinement.

Should I filter tool call outputs or only final responses?

Filter both. Tool call outputs can contain injected content that influences the agent's subsequent reasoning. Final responses are what users see. Apply the full security pipeline to tool outputs as they are ingested, and apply it again to the agent's final response before delivery.

How does output filtering interact with streaming responses?

Streaming complicates content security because you cannot analyze the full response before sending tokens to the user. Buffer a configurable amount of text (for example, sentence boundaries) and run filters on each buffer before flushing to the client. For pattern-based filters, maintain state across buffers to detect patterns that span chunk boundaries.


#ContentSecurity #OutputFiltering #AISafety #ContentModeration #AgentGuardrails #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

AI Agent Safety Research 2026: Alignment, Sandboxing, and Constitutional AI for Agents

Current state of AI agent safety research covering alignment techniques, sandbox environments, constitutional AI applied to agents, and red-teaming methodologies.

AI Interview Prep

6 AI Safety & Alignment Interview Questions From Anthropic & OpenAI (2026)

Real AI safety and alignment interview questions from Anthropic and OpenAI in 2026. Covers alignment challenges, RLHF vs DPO, responsible scaling, red-teaming, safety-first decisions, and autonomous agent oversight.

Learn Agentic AI

Agent Governance and Oversight: Building Control Planes for Autonomous Agent Systems

Build comprehensive governance systems for autonomous AI agents including control plane dashboards, kill switches, audit trails, budget enforcement, and human escalation mechanisms with production-ready Python implementations.

Learn Agentic AI

Sandboxing Agent Tool Execution: Running Untrusted Code and Commands Safely

Learn how to sandbox AI agent tool execution using Docker containers, restricted file systems, timeout enforcement, and resource limits to prevent agents from causing damage through code execution tools.

Learn Agentic AI

Rate Limiting AI Agents: Preventing Abuse and Controlling API Costs

Implement per-user rate limiting, token budgets, sliding window algorithms, and graceful degradation strategies to protect your AI agent system from abuse while controlling LLM API costs.

Learn Agentic AI

Hallucination Detection and Mitigation in AI Agent Systems

Learn practical techniques to detect and reduce LLM hallucinations in AI agents, including grounding with source documents, citation verification, confidence scoring, and human-in-the-loop escalation patterns.