---
title: "Reflection Agents: Building AI Systems That Critique and Improve Their Own Output"
description: "Learn how to build reflection agents that evaluate their own outputs, assign quality scores, and iteratively refine results through multi-round self-improvement loops using the Reflexion pattern."
canonical: https://callsphere.ai/blog/reflection-agents-ai-self-critique-improve-output
category: "Learn Agentic AI"
tags: ["Reflection Agents", "Reflexion Pattern", "Self-Evaluation", "AI Architecture", "Python"]
author: "CallSphere Team"
published: 2026-03-18T00:00:00.000Z
updated: 2026-05-06T01:02:46.041Z
---

# Reflection Agents: Building AI Systems That Critique and Improve Their Own Output

> Learn how to build reflection agents that evaluate their own outputs, assign quality scores, and iteratively refine results through multi-round self-improvement loops using the Reflexion pattern.

## What Are Reflection Agents?

A reflection agent is an AI system that generates an initial output, then turns around and critiques that output against explicit quality criteria. Based on the critique, it produces an improved version. This loop repeats until the output meets a quality threshold or a maximum number of iterations is reached.

The concept draws from the **Reflexion** paper (Shinn et al., 2023), which demonstrated that LLM agents equipped with verbal self-reflection significantly outperform single-pass agents on coding, reasoning, and decision-making benchmarks.

## The Core Loop: Generate, Evaluate, Refine

Every reflection agent follows the same three-step cycle:

```mermaid
flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness
PromptFoo or Braintrust"]
    GOLD[("Golden set
200 tagged cases")]
    JUDGE["LLM as judge
plus regex graders"]
    SCORE["Aggregate score
and per slice"]
    GATE{"Score regress
more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
```

1. **Generate** — produce an initial response to the task
2. **Evaluate** — score the response against defined criteria
3. **Refine** — use the evaluation feedback to produce a better version

Here is a minimal implementation:

```python
from openai import OpenAI

client = OpenAI()

def reflect_and_improve(task: str, max_rounds: int = 3, threshold: float = 8.0):
    """Generate, evaluate, and refine output over multiple rounds."""

    # Round 1: initial generation
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": task}],
    )
    current_output = response.choices[0].message.content

    for round_num in range(max_rounds):
        # Evaluate the current output
        eval_prompt = f"""Rate the following output on a scale of 1-10 for:
- Accuracy (factual correctness)
- Completeness (covers all aspects)
- Clarity (easy to understand)

Output to evaluate:
{current_output}

Original task: {task}

Return JSON: {{"score": , "weaknesses": ["..."], "suggestions": ["..."]}}"""

        eval_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": eval_prompt}],
            response_format={"type": "json_object"},
        )
        evaluation = eval_response.choices[0].message.content
        import json
        eval_data = json.loads(evaluation)

        print(f"Round {round_num + 1}: Score = {eval_data['score']}")

        if eval_data["score"] >= threshold:
            print("Quality threshold met.")
            return current_output

        # Refine based on feedback
        refine_prompt = f"""Improve this output based on the feedback below.

Original task: {task}
Current output: {current_output}
Weaknesses: {eval_data['weaknesses']}
Suggestions: {eval_data['suggestions']}

Write an improved version that addresses every weakness."""

        refined = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": refine_prompt}],
        )
        current_output = refined.choices[0].message.content

    return current_output
```

## Separating the Critic from the Generator

A stronger pattern uses **two separate system prompts** — one optimized for generation, another for harsh evaluation. This prevents the self-serving bias where a model rates its own output too generously.

```python
GENERATOR_SYSTEM = """You are an expert technical writer.
Produce detailed, accurate, well-structured content."""

CRITIC_SYSTEM = """You are a demanding editor and fact-checker.
Find every flaw, gap, and inaccuracy. Be harsh but constructive.
Never rate above 7 unless the output is genuinely excellent."""
```

By giving the critic a deliberately strict persona, you force the generator to actually earn a passing score rather than coasting through on inflated self-assessments.

## Multi-Dimensional Scoring

Production reflection agents score across multiple dimensions rather than using a single number. A scoring rubric might look like this:

```python
RUBRIC = {
    "accuracy": "Are all facts, numbers, and claims correct?",
    "completeness": "Does the output address every part of the task?",
    "clarity": "Is the writing clear and free of jargon?",
    "actionability": "Can the reader act on this immediately?",
    "structure": "Is the output well-organized with logical flow?",
}

def multi_dim_evaluate(output: str, task: str) -> dict:
    prompt = "Score this output 1-10 on each dimension:\n"
    for dim, question in RUBRIC.items():
        prompt += f"- {dim}: {question}\n"
    prompt += f"\nOutput: {output}\nTask: {task}"
    prompt += "\nReturn JSON with each dimension score and overall average."
    # ... call LLM and parse response
```

## When Reflection Helps (and When It Hurts)

Reflection adds latency and cost — each round means additional LLM calls. Use it when:

- **High stakes**: the output will be shown to customers or used in decisions
- **Complex tasks**: multi-step reasoning where errors compound
- **Code generation**: where the critic can actually run tests to verify correctness

Skip it when the task is simple, latency-sensitive, or when the first-pass quality is already high enough for your use case.

## FAQ

### How many reflection rounds are typically needed?

Most tasks converge after 2-3 rounds. Research shows diminishing returns beyond 3 rounds for text generation tasks, though coding tasks can benefit from up to 5 rounds when combined with test execution.

### Should the generator and critic use the same model?

Not necessarily. A common pattern uses a stronger model (GPT-4o) as the critic and a faster model (GPT-4o-mini) as the generator. This keeps costs down while maintaining evaluation quality. Some teams even use a smaller fine-tuned model for the critic.

### How do you prevent infinite loops where the critic is never satisfied?

Always set a `max_rounds` limit. Additionally, track scores across rounds — if the score plateaus or decreases for two consecutive rounds, break early. The agent should recognize when further refinement is not productive.

---

#ReflectionAgents #Reflexion #SelfImprovement #AgenticAI #LLMArchitecture #AIEngineering #PythonAI #PromptEngineering

---

Source: https://callsphere.ai/blog/reflection-agents-ai-self-critique-improve-output
