Skip to content
Learn Agentic AI
Learn Agentic AI10 min read7 views

Reflection Agents: Building AI Systems That Critique and Improve Their Own Output

Learn how to build reflection agents that evaluate their own outputs, assign quality scores, and iteratively refine results through multi-round self-improvement loops using the Reflexion pattern.

What Are Reflection Agents?

A reflection agent is an AI system that generates an initial output, then turns around and critiques that output against explicit quality criteria. Based on the critique, it produces an improved version. This loop repeats until the output meets a quality threshold or a maximum number of iterations is reached.

The concept draws from the Reflexion paper (Shinn et al., 2023), which demonstrated that LLM agents equipped with verbal self-reflection significantly outperform single-pass agents on coding, reasoning, and decision-making benchmarks.

The Core Loop: Generate, Evaluate, Refine

Every reflection agent follows the same three-step cycle:

flowchart TD
    START["Reflection Agents: Building AI Systems That Criti…"] --> A
    A["What Are Reflection Agents?"]
    A --> B
    B["The Core Loop: Generate, Evaluate, Refi…"]
    B --> C
    C["Separating the Critic from the Generator"]
    C --> D
    D["Multi-Dimensional Scoring"]
    D --> E
    E["When Reflection Helps and When It Hurts"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
  1. Generate — produce an initial response to the task
  2. Evaluate — score the response against defined criteria
  3. Refine — use the evaluation feedback to produce a better version

Here is a minimal implementation:

from openai import OpenAI

client = OpenAI()

def reflect_and_improve(task: str, max_rounds: int = 3, threshold: float = 8.0):
    """Generate, evaluate, and refine output over multiple rounds."""

    # Round 1: initial generation
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": task}],
    )
    current_output = response.choices[0].message.content

    for round_num in range(max_rounds):
        # Evaluate the current output
        eval_prompt = f"""Rate the following output on a scale of 1-10 for:
- Accuracy (factual correctness)
- Completeness (covers all aspects)
- Clarity (easy to understand)

Output to evaluate:
{current_output}

Original task: {task}

Return JSON: {{"score": <average>, "weaknesses": ["..."], "suggestions": ["..."]}}"""

        eval_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": eval_prompt}],
            response_format={"type": "json_object"},
        )
        evaluation = eval_response.choices[0].message.content
        import json
        eval_data = json.loads(evaluation)

        print(f"Round {round_num + 1}: Score = {eval_data['score']}")

        if eval_data["score"] >= threshold:
            print("Quality threshold met.")
            return current_output

        # Refine based on feedback
        refine_prompt = f"""Improve this output based on the feedback below.

Original task: {task}
Current output: {current_output}
Weaknesses: {eval_data['weaknesses']}
Suggestions: {eval_data['suggestions']}

Write an improved version that addresses every weakness."""

        refined = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": refine_prompt}],
        )
        current_output = refined.choices[0].message.content

    return current_output

Separating the Critic from the Generator

A stronger pattern uses two separate system prompts — one optimized for generation, another for harsh evaluation. This prevents the self-serving bias where a model rates its own output too generously.

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Generate — produce an initial response …"]
    CENTER --> N1["Evaluate — score the response against d…"]
    CENTER --> N2["Refine — use the evaluation feedback to…"]
    CENTER --> N3["High stakes: the output will be shown t…"]
    CENTER --> N4["Complex tasks: multi-step reasoning whe…"]
    CENTER --> N5["Code generation: where the critic can a…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
GENERATOR_SYSTEM = """You are an expert technical writer.
Produce detailed, accurate, well-structured content."""

CRITIC_SYSTEM = """You are a demanding editor and fact-checker.
Find every flaw, gap, and inaccuracy. Be harsh but constructive.
Never rate above 7 unless the output is genuinely excellent."""

By giving the critic a deliberately strict persona, you force the generator to actually earn a passing score rather than coasting through on inflated self-assessments.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Multi-Dimensional Scoring

Production reflection agents score across multiple dimensions rather than using a single number. A scoring rubric might look like this:

RUBRIC = {
    "accuracy": "Are all facts, numbers, and claims correct?",
    "completeness": "Does the output address every part of the task?",
    "clarity": "Is the writing clear and free of jargon?",
    "actionability": "Can the reader act on this immediately?",
    "structure": "Is the output well-organized with logical flow?",
}

def multi_dim_evaluate(output: str, task: str) -> dict:
    prompt = "Score this output 1-10 on each dimension:\n"
    for dim, question in RUBRIC.items():
        prompt += f"- {dim}: {question}\n"
    prompt += f"\nOutput: {output}\nTask: {task}"
    prompt += "\nReturn JSON with each dimension score and overall average."
    # ... call LLM and parse response

When Reflection Helps (and When It Hurts)

Reflection adds latency and cost — each round means additional LLM calls. Use it when:

  • High stakes: the output will be shown to customers or used in decisions
  • Complex tasks: multi-step reasoning where errors compound
  • Code generation: where the critic can actually run tests to verify correctness

Skip it when the task is simple, latency-sensitive, or when the first-pass quality is already high enough for your use case.

FAQ

How many reflection rounds are typically needed?

Most tasks converge after 2-3 rounds. Research shows diminishing returns beyond 3 rounds for text generation tasks, though coding tasks can benefit from up to 5 rounds when combined with test execution.

Should the generator and critic use the same model?

Not necessarily. A common pattern uses a stronger model (GPT-4o) as the critic and a faster model (GPT-4o-mini) as the generator. This keeps costs down while maintaining evaluation quality. Some teams even use a smaller fine-tuned model for the critic.

How do you prevent infinite loops where the critic is never satisfied?

Always set a max_rounds limit. Additionally, track scores across rounds — if the score plateaus or decreases for two consecutive rounds, break early. The agent should recognize when further refinement is not productive.


#ReflectionAgents #Reflexion #SelfImprovement #AgenticAI #LLMArchitecture #AIEngineering #PythonAI #PromptEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

OpenAI Agents SDK in 2026: Building Multi-Agent Systems with Handoffs and Guardrails

Complete tutorial on the OpenAI Agents SDK covering agent creation, tool definitions, handoff patterns between specialist agents, and input/output guardrails for safe AI systems.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.

Learn Agentic AI

Build a Customer Support Agent from Scratch: Python, OpenAI, and Twilio in 60 Minutes

Step-by-step tutorial to build a production-ready customer support AI agent using Python FastAPI, OpenAI Agents SDK, and Twilio Voice with five integrated tools.

Learn Agentic AI

LangGraph Agent Patterns 2026: Building Stateful Multi-Step AI Workflows

Complete LangGraph tutorial covering state machines for agents, conditional edges, human-in-the-loop patterns, checkpointing, and parallel execution with full code examples.