Constitutional AI Prompting: Building Self-Governing Language Model Behavior

From External Guardrails to Internal Principles

Traditional content moderation works by filtering model outputs after generation — a classifier checks the response and blocks it if it violates a rule. This is reactive and brittle. The model does not understand why a response is problematic, so it cannot improve on its own.

Constitutional AI (CAI), introduced by Anthropic, takes a different approach. Instead of external filters, you give the model a set of principles — a "constitution" — and have it critique and revise its own outputs against those principles. The model learns to self-correct, producing better outputs in fewer iterations.

As a prompt engineering technique, CAI does not require fine-tuning. You can implement critique-revision loops purely through prompting, using any capable LLM.

Defining a Constitution

A constitution is a set of explicit principles that guide model behavior. Each principle should be specific enough to evaluate against but general enough to apply across situations:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
    SPEC(["Task spec"])
    SYSTEM["System prompt<br/>role plus rules"]
    SHOTS["Few shot examples<br/>3 to 5"]
    VARS["Variable injection<br/>Jinja or f-string"]
    COT["Chain of thought<br/>or scratchpad"]
    CONSTR["Output constraint<br/>JSON schema"]
    LLM["LLM call"]
    EVAL["Offline eval<br/>LLM as judge plus regex"]
    GATE{"Score over<br/>threshold?"}
    COMMIT(["Promote to prod<br/>version pinned"])
    REVISE(["Revise prompt"])
    SPEC --> SYSTEM --> SHOTS --> VARS --> COT --> CONSTR --> LLM --> EVAL --> GATE
    GATE -->|Yes| COMMIT
    GATE -->|No| REVISE --> SYSTEM
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style COMMIT fill:#059669,stroke:#047857,color:#fff

CONSTITUTION = [
    {
        "name": "Helpfulness",
        "principle": (
            "The response should directly address the user's question "
            "with accurate, actionable information. Avoid vague or "
            "evasive answers."
        ),
    },
    {
        "name": "Honesty",
        "principle": (
            "The response should not present speculation as fact. "
            "When uncertain, the response should explicitly state the "
            "level of confidence. Claims should be verifiable."
        ),
    },
    {
        "name": "Harmlessness",
        "principle": (
            "The response should not provide instructions that could "
            "cause physical, financial, or emotional harm. When a "
            "request has harmful potential, the response should "
            "address the legitimate need while refusing the harmful aspect."
        ),
    },
    {
        "name": "Fairness",
        "principle": (
            "The response should not reinforce stereotypes or make "
            "assumptions based on demographics. When discussing groups "
            "of people, use balanced and evidence-based language."
        ),
    },
]

The Critique-Revision Loop

The core CAI pattern is a two-step loop: critique the current response against each principle, then revise to address the critique:

import openai

client = openai.OpenAI()

def critique_response(
    question: str,
    response: str,
    principles: list[dict],
) -> list[dict]:
    """Critique a response against constitutional principles."""
    critiques = []
    for principle in principles:
        result = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": (
                    "You are a constitutional reviewer. Evaluate the "
                    "response against the given principle. Identify "
                    "specific violations, if any. Be concise and precise."
                )},
                {"role": "user", "content": (
                    f"Principle ({principle['name']}): "
                    f"{principle['principle']}\n\n"
                    f"User question: {question}\n\n"
                    f"Response to evaluate: {response}\n\n"
                    "Does this response violate the principle? If yes, "
                    "explain specifically how. If no, say 'No violation.'"
                )},
            ],
            temperature=0,
        )
        critique = result.choices[0].message.content
        critiques.append({
            "principle": principle["name"],
            "critique": critique,
            "has_violation": "no violation" not in critique.lower(),
        })
    return critiques

def revise_response(
    question: str,
    response: str,
    critiques: list[dict],
) -> str:
    """Revise the response to address constitutional critiques."""
    violations = [c for c in critiques if c["has_violation"]]
    if not violations:
        return response

    critique_text = "\n".join(
        f"- {v['principle']}: {v['critique']}" for v in violations
    )

    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Revise the response to address all constitutional "
                "critiques while maintaining helpfulness. Keep the "
                "useful content and fix only the identified issues."
            )},
            {"role": "user", "content": (
                f"Original question: {question}\n\n"
                f"Current response: {response}\n\n"
                f"Critiques to address:\n{critique_text}\n\n"
                "Provide the revised response:"
            )},
        ],
        temperature=0,
    )
    return result.choices[0].message.content

Running the Full Constitutional Loop

Putting it together into an iterative refinement pipeline:

def constitutional_generate(
    question: str,
    max_revisions: int = 3,
) -> dict:
    """Generate a response with constitutional self-governance."""
    # Initial generation
    initial = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
    )
    response = initial.choices[0].message.content
    history = [{"version": 0, "response": response, "critiques": []}]

    for i in range(max_revisions):
        critiques = critique_response(question, response, CONSTITUTION)
        has_violations = any(c["has_violation"] for c in critiques)

        history.append({
            "version": i + 1,
            "critiques": critiques,
            "had_violations": has_violations,
        })

        if not has_violations:
            break

        response = revise_response(question, response, critiques)
        history[-1]["response"] = response

    return {
        "final_response": response,
        "revision_count": len(history) - 1,
        "history": history,
    }

Red-Team Prompting with CAI

CAI principles are especially powerful for red-team testing. You can proactively test your system by generating adversarial prompts and checking whether the constitutional loop catches them:

def red_team_test(
    system_prompt: str,
    adversarial_queries: list[str],
) -> list[dict]:
    """Test a system prompt against adversarial inputs."""
    results = []
    for query in adversarial_queries:
        result = constitutional_generate(query)
        results.append({
            "query": query,
            "revision_count": result["revision_count"],
            "passed": result["revision_count"] < 3,
            "final_response": result["final_response"][:200],
        })
    return results

This gives you a systematic way to validate that your constitution catches the failure modes you care about before deploying to production.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

How many principles should a constitution have?

Start with 3 to 5 core principles. More principles mean more critique calls per response, increasing latency and cost. Prioritize the principles that address your highest-risk failure modes. You can always expand the constitution as you discover new failure patterns in production.

Does the critique-revision loop guarantee safe outputs?

No. Constitutional AI significantly reduces harmful outputs, but it is not a guarantee. The model might fail to identify subtle violations during critique, or the revision might introduce new issues. CAI works best as one layer in a defense-in-depth strategy that includes output filtering, monitoring, and human review for high-stakes applications.

Can I use CAI with smaller open-source models?

The technique requires a model capable enough to meaningfully critique its own outputs. Models under 13B parameters often struggle with nuanced critique. A practical alternative is to use a larger model for the critique step and a smaller model for generation, keeping inference costs manageable.

#PromptEngineering #ConstitutionalAI #Safety #Alignment #Python #AgenticAI #LearnAI #AIEngineering

Constitutional AI Prompting: Building Self-Governing Language Model Behavior

From External Guardrails to Internal Principles

Defining a Constitution

The Critique-Revision Loop

Running the Full Constitutional Loop

Red-Team Prompting with CAI

FAQ

How many principles should a constitution have?

Does the critique-revision loop guarantee safe outputs?

Can I use CAI with smaller open-source models?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Enterprise CIO Guide: Anthropic Skills — Loadable Agent Tool Packs

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

The Constitutional AI Origin Myth: Was It Really About Safety, or Differentiation?

Constitutional AI: Genuine Safety Moat or Sophisticated Marketing?

Claude's Published System Prompts: What They Reveal About Anthropic's Strategy