Reflection Agents: Building AI Systems That Critique and Improve Their Own Output

What Are Reflection Agents?

A reflection agent is an AI system that generates an initial output, then turns around and critiques that output against explicit quality criteria. Based on the critique, it produces an improved version. This loop repeats until the output meets a quality threshold or a maximum number of iterations is reached.

The concept draws from the Reflexion paper (Shinn et al., 2023), which demonstrated that LLM agents equipped with verbal self-reflection significantly outperform single-pass agents on coding, reasoning, and decision-making benchmarks.

The Core Loop: Generate, Evaluate, Refine

Every reflection agent follows the same three-step cycle:

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

Generate — produce an initial response to the task
Evaluate — score the response against defined criteria
Refine — use the evaluation feedback to produce a better version

Here is a minimal implementation:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

from openai import OpenAI

client = OpenAI()

def reflect_and_improve(task: str, max_rounds: int = 3, threshold: float = 8.0):
    """Generate, evaluate, and refine output over multiple rounds."""

    # Round 1: initial generation
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": task}],
    )
    current_output = response.choices[0].message.content

    for round_num in range(max_rounds):
        # Evaluate the current output
        eval_prompt = f"""Rate the following output on a scale of 1-10 for:
- Accuracy (factual correctness)
- Completeness (covers all aspects)
- Clarity (easy to understand)

Output to evaluate:
{current_output}

Original task: {task}

Return JSON: {{"score": <average>, "weaknesses": ["..."], "suggestions": ["..."]}}"""

        eval_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": eval_prompt}],
            response_format={"type": "json_object"},
        )
        evaluation = eval_response.choices[0].message.content
        import json
        eval_data = json.loads(evaluation)

        print(f"Round {round_num + 1}: Score = {eval_data['score']}")

        if eval_data["score"] >= threshold:
            print("Quality threshold met.")
            return current_output

        # Refine based on feedback
        refine_prompt = f"""Improve this output based on the feedback below.

Original task: {task}
Current output: {current_output}
Weaknesses: {eval_data['weaknesses']}
Suggestions: {eval_data['suggestions']}

Write an improved version that addresses every weakness."""

        refined = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": refine_prompt}],
        )
        current_output = refined.choices[0].message.content

    return current_output

Separating the Critic from the Generator

A stronger pattern uses two separate system prompts — one optimized for generation, another for harsh evaluation. This prevents the self-serving bias where a model rates its own output too generously.

GENERATOR_SYSTEM = """You are an expert technical writer.
Produce detailed, accurate, well-structured content."""

CRITIC_SYSTEM = """You are a demanding editor and fact-checker.
Find every flaw, gap, and inaccuracy. Be harsh but constructive.
Never rate above 7 unless the output is genuinely excellent."""

By giving the critic a deliberately strict persona, you force the generator to actually earn a passing score rather than coasting through on inflated self-assessments.

Multi-Dimensional Scoring

Production reflection agents score across multiple dimensions rather than using a single number. A scoring rubric might look like this:

RUBRIC = {
    "accuracy": "Are all facts, numbers, and claims correct?",
    "completeness": "Does the output address every part of the task?",
    "clarity": "Is the writing clear and free of jargon?",
    "actionability": "Can the reader act on this immediately?",
    "structure": "Is the output well-organized with logical flow?",
}

def multi_dim_evaluate(output: str, task: str) -> dict:
    prompt = "Score this output 1-10 on each dimension:\n"
    for dim, question in RUBRIC.items():
        prompt += f"- {dim}: {question}\n"
    prompt += f"\nOutput: {output}\nTask: {task}"
    prompt += "\nReturn JSON with each dimension score and overall average."
    # ... call LLM and parse response

When Reflection Helps (and When It Hurts)

Reflection adds latency and cost — each round means additional LLM calls. Use it when:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

High stakes: the output will be shown to customers or used in decisions
Complex tasks: multi-step reasoning where errors compound
Code generation: where the critic can actually run tests to verify correctness

Skip it when the task is simple, latency-sensitive, or when the first-pass quality is already high enough for your use case.

FAQ

How many reflection rounds are typically needed?

Most tasks converge after 2-3 rounds. Research shows diminishing returns beyond 3 rounds for text generation tasks, though coding tasks can benefit from up to 5 rounds when combined with test execution.

Should the generator and critic use the same model?

Not necessarily. A common pattern uses a stronger model (GPT-4o) as the critic and a faster model (GPT-4o-mini) as the generator. This keeps costs down while maintaining evaluation quality. Some teams even use a smaller fine-tuned model for the critic.

How do you prevent infinite loops where the critic is never satisfied?

Always set a max_rounds limit. Additionally, track scores across rounds — if the score plateaus or decreases for two consecutive rounds, break early. The agent should recognize when further refinement is not productive.

#ReflectionAgents #Reflexion #SelfImprovement #AgenticAI #LLMArchitecture #AIEngineering #PythonAI #PromptEngineering

Reflection Agents: Building AI Systems That Critique and Improve Their Own Output

What Are Reflection Agents?

The Core Loop: Generate, Evaluate, Refine

Separating the Critic from the Generator

Multi-Dimensional Scoring

When Reflection Helps (and When It Hurts)

FAQ

How many reflection rounds are typically needed?

Should the generator and critic use the same model?

How do you prevent infinite loops where the critic is never satisfied?

Try CallSphere AI Voice Agents

Related Articles You May Like

Long-Running Agent Workflows: The 2026 Enterprise Blueprint

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

From 14,000 Files To 15: Why Smart Context Selection Is The 2026 Agentic AI Moat

Why Voice AI Builders Pick OpenAI Over Claude (and When That's the Wrong Call)

A Decision Framework: When to Pick GPT-5.5, GPT-5.5 Pro, or Claude Opus 4.7 in 2026