Skip to content
Learn Agentic AI
Learn Agentic AI11 min read2 views

Constitutional AI Prompting: Building Self-Governing Language Model Behavior

Learn how Constitutional AI prompting uses explicit principles and critique-revision loops to make LLMs self-correct harmful or low-quality outputs without human feedback.

From External Guardrails to Internal Principles

Traditional content moderation works by filtering model outputs after generation — a classifier checks the response and blocks it if it violates a rule. This is reactive and brittle. The model does not understand why a response is problematic, so it cannot improve on its own.

Constitutional AI (CAI), introduced by Anthropic, takes a different approach. Instead of external filters, you give the model a set of principles — a "constitution" — and have it critique and revise its own outputs against those principles. The model learns to self-correct, producing better outputs in fewer iterations.

As a prompt engineering technique, CAI does not require fine-tuning. You can implement critique-revision loops purely through prompting, using any capable LLM.

Defining a Constitution

A constitution is a set of explicit principles that guide model behavior. Each principle should be specific enough to evaluate against but general enough to apply across situations:

flowchart TD
    START["Constitutional AI Prompting: Building Self-Govern…"] --> A
    A["From External Guardrails to Internal Pr…"]
    A --> B
    B["Defining a Constitution"]
    B --> C
    C["The Critique-Revision Loop"]
    C --> D
    D["Running the Full Constitutional Loop"]
    D --> E
    E["Red-Team Prompting with CAI"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
CONSTITUTION = [
    {
        "name": "Helpfulness",
        "principle": (
            "The response should directly address the user's question "
            "with accurate, actionable information. Avoid vague or "
            "evasive answers."
        ),
    },
    {
        "name": "Honesty",
        "principle": (
            "The response should not present speculation as fact. "
            "When uncertain, the response should explicitly state the "
            "level of confidence. Claims should be verifiable."
        ),
    },
    {
        "name": "Harmlessness",
        "principle": (
            "The response should not provide instructions that could "
            "cause physical, financial, or emotional harm. When a "
            "request has harmful potential, the response should "
            "address the legitimate need while refusing the harmful aspect."
        ),
    },
    {
        "name": "Fairness",
        "principle": (
            "The response should not reinforce stereotypes or make "
            "assumptions based on demographics. When discussing groups "
            "of people, use balanced and evidence-based language."
        ),
    },
]

The Critique-Revision Loop

The core CAI pattern is a two-step loop: critique the current response against each principle, then revise to address the critique:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import openai

client = openai.OpenAI()

def critique_response(
    question: str,
    response: str,
    principles: list[dict],
) -> list[dict]:
    """Critique a response against constitutional principles."""
    critiques = []
    for principle in principles:
        result = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": (
                    "You are a constitutional reviewer. Evaluate the "
                    "response against the given principle. Identify "
                    "specific violations, if any. Be concise and precise."
                )},
                {"role": "user", "content": (
                    f"Principle ({principle['name']}): "
                    f"{principle['principle']}\n\n"
                    f"User question: {question}\n\n"
                    f"Response to evaluate: {response}\n\n"
                    "Does this response violate the principle? If yes, "
                    "explain specifically how. If no, say 'No violation.'"
                )},
            ],
            temperature=0,
        )
        critique = result.choices[0].message.content
        critiques.append({
            "principle": principle["name"],
            "critique": critique,
            "has_violation": "no violation" not in critique.lower(),
        })
    return critiques

def revise_response(
    question: str,
    response: str,
    critiques: list[dict],
) -> str:
    """Revise the response to address constitutional critiques."""
    violations = [c for c in critiques if c["has_violation"]]
    if not violations:
        return response

    critique_text = "\n".join(
        f"- {v['principle']}: {v['critique']}" for v in violations
    )

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Revise the response to address all constitutional "
                "critiques while maintaining helpfulness. Keep the "
                "useful content and fix only the identified issues."
            )},
            {"role": "user", "content": (
                f"Original question: {question}\n\n"
                f"Current response: {response}\n\n"
                f"Critiques to address:\n{critique_text}\n\n"
                "Provide the revised response:"
            )},
        ],
        temperature=0,
    )
    return result.choices[0].message.content

Running the Full Constitutional Loop

Putting it together into an iterative refinement pipeline:

def constitutional_generate(
    question: str,
    max_revisions: int = 3,
) -> dict:
    """Generate a response with constitutional self-governance."""
    # Initial generation
    initial = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    response = initial.choices[0].message.content
    history = [{"version": 0, "response": response, "critiques": []}]

    for i in range(max_revisions):
        critiques = critique_response(question, response, CONSTITUTION)
        has_violations = any(c["has_violation"] for c in critiques)

        history.append({
            "version": i + 1,
            "critiques": critiques,
            "had_violations": has_violations,
        })

        if not has_violations:
            break

        response = revise_response(question, response, critiques)
        history[-1]["response"] = response

    return {
        "final_response": response,
        "revision_count": len(history) - 1,
        "history": history,
    }

Red-Team Prompting with CAI

CAI principles are especially powerful for red-team testing. You can proactively test your system by generating adversarial prompts and checking whether the constitutional loop catches them:

def red_team_test(
    system_prompt: str,
    adversarial_queries: list[str],
) -> list[dict]:
    """Test a system prompt against adversarial inputs."""
    results = []
    for query in adversarial_queries:
        result = constitutional_generate(query)
        results.append({
            "query": query,
            "revision_count": result["revision_count"],
            "passed": result["revision_count"] < 3,
            "final_response": result["final_response"][:200],
        })
    return results

This gives you a systematic way to validate that your constitution catches the failure modes you care about before deploying to production.

FAQ

How many principles should a constitution have?

Start with 3 to 5 core principles. More principles mean more critique calls per response, increasing latency and cost. Prioritize the principles that address your highest-risk failure modes. You can always expand the constitution as you discover new failure patterns in production.

Does the critique-revision loop guarantee safe outputs?

No. Constitutional AI significantly reduces harmful outputs, but it is not a guarantee. The model might fail to identify subtle violations during critique, or the revision might introduce new issues. CAI works best as one layer in a defense-in-depth strategy that includes output filtering, monitoring, and human review for high-stakes applications.

Can I use CAI with smaller open-source models?

The technique requires a model capable enough to meaningfully critique its own outputs. Models under 13B parameters often struggle with nuanced critique. A practical alternative is to use a larger model for the critique step and a smaller model for generation, keeping inference costs manageable.


#PromptEngineering #ConstitutionalAI #Safety #Alignment #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

Database Integration Patterns for AI Agents: Read-Only, Write-Through, and Event-Driven

How AI agents interact with databases safely using read-only tools for queries, write-through validation layers, and event-driven updates via message queues.

Learn Agentic AI

AI Agent Safety Research 2026: Alignment, Sandboxing, and Constitutional AI for Agents

Current state of AI agent safety research covering alignment techniques, sandbox environments, constitutional AI applied to agents, and red-teaming methodologies.