---
title: "Self-Reflection in AI Agents: Building Systems That Learn from Mistakes"
description: "Explore how self-reflection transforms AI agents from one-shot executors into iterative improvers — covering critique loops, retry-with-feedback, score-and-improve patterns, and practical Python implementations."
canonical: https://callsphere.ai/blog/self-reflection-in-ai-agents-building-systems-that-learn
category: "Learn Agentic AI"
tags: ["Self-Reflection", "AI Agents", "Critique Loops", "Quality Assurance", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:42.266Z
---

# Self-Reflection in AI Agents: Building Systems That Learn from Mistakes

> Explore how self-reflection transforms AI agents from one-shot executors into iterative improvers — covering critique loops, retry-with-feedback, score-and-improve patterns, and practical Python implementations.

## The Problem with One-Shot Execution

Most AI agents generate a response and move on. If the output is wrong, incomplete, or poorly formatted, the user has to notice the problem and ask for a correction. This is fragile. Humans miss errors, and the feedback loop is slow.

Self-reflection changes this by adding an internal quality check. Before returning a result to the user, the agent evaluates its own output, identifies weaknesses, and improves it — all within the same execution loop. The result is higher quality output with fewer round trips.

## The Basic Critique Loop

The simplest self-reflection pattern uses two LLM calls: one to generate, one to critique.

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
from openai import OpenAI

client = OpenAI()

def generate_with_reflection(task: str, max_reflections: int = 3) -> str:
    # Step 1: Generate initial output
    draft = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a technical writer."},
            {"role": "user", "content": task},
        ],
    ).choices[0].message.content

    for i in range(max_reflections):
        # Step 2: Critique the output
        critique = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": (
                    "You are a critical reviewer. Evaluate the following output for:"
                    "\n1. Factual accuracy"
                    "\n2. Completeness (does it address all aspects of the task?)"
                    "\n3. Clarity and structure"
                    "\n4. Any errors or inconsistencies"
                    "\nIf the output is satisfactory, respond with exactly: APPROVED"
                    "\nOtherwise, list specific improvements needed."
                )},
                {"role": "user", "content": f"Task: {task}\n\nOutput:\n{draft}"},
            ],
        ).choices[0].message.content

        # If approved, return the draft
        if "APPROVED" in critique.upper():
            return draft

        # Step 3: Improve based on critique
        draft = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a technical writer. "
                 "Revise your output based on the feedback provided."},
                {"role": "user", "content": (
                    f"Original task: {task}\n\n"
                    f"Your previous draft:\n{draft}\n\n"
                    f"Reviewer feedback:\n{critique}\n\n"
                    "Please produce an improved version addressing all feedback."
                )},
            ],
        ).choices[0].message.content

    return draft  # Return best attempt after max reflections
```

Each iteration produces a measurably better output because the critique identifies specific issues that the revision addresses. In practice, most outputs reach "APPROVED" quality within 1-2 reflection cycles.

## Score-and-Improve Pattern

For more structured reflection, assign numerical scores to specific quality dimensions. This gives you quantifiable improvement tracking and clearer termination criteria.

```python
import json

def score_and_improve(task: str, output: str, threshold: float = 8.0) -> dict:
    """Score output on multiple dimensions, improve if below threshold."""

    # Score the output
    scoring_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Score the following output on a scale of 1-10 for each dimension. "
                "Return JSON with scores and brief justifications.\n"
                "Dimensions: accuracy, completeness, clarity, actionability"
            )},
            {"role": "user", "content": f"Task: {task}\nOutput: {output}"},
        ],
        response_format={"type": "json_object"},
    )

    scores = json.loads(scoring_response.choices[0].message.content)

    # Calculate average score
    dimensions = ["accuracy", "completeness", "clarity", "actionability"]
    avg_score = sum(scores.get(d, {}).get("score", 0) for d in dimensions) / len(dimensions)

    if avg_score >= threshold:
        return {"output": output, "scores": scores, "improved": False}

    # Identify weak dimensions for targeted improvement
    weak_dims = [d for d in dimensions if scores.get(d, {}).get("score", 0)  str:
    messages = [
        {"role": "system", "content": (
            "You are a careful agent. After completing a task, evaluate "
            "your own work before presenting it to the user. If your output "
            "has gaps or errors, fix them before responding."
        )},
        {"role": "user", "content": goal},
    ]

    for step in range(max_steps):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
        )
        msg = response.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:
            # Before returning, add a self-check
            check = client.chat.completions.create(
                model="gpt-4o",
                messages=messages + [{
                    "role": "user",
                    "content": (
                        "Review your response. Is it complete, accurate, and "
                        "fully addresses the original goal? If yes, say FINAL. "
                        "If not, explain what needs fixing."
                    ),
                }],
            ).choices[0].message.content

            if "FINAL" in check.upper():
                return msg.content

            # Continue improving
            messages.append({"role": "user", "content": f"Self-review: {check}. Please improve."})
            continue

        # Execute tool calls
        for tc in msg.tool_calls:
            args = json.loads(tc.function.arguments)
            result = execute_tool(tc.function.name, args)
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": json.dumps(result),
            })

    return messages[-1].get("content", "Task incomplete.")
```

## FAQ

### Does self-reflection double the cost of every agent call?

Not quite double, because critique prompts are typically shorter than generation prompts. Expect 40-70% additional token cost per reflection cycle. The tradeoff is worth it for high-stakes outputs (reports, code, customer communications) where quality matters more than cost. Skip reflection for low-stakes tasks like simple lookups.

### Can the same model effectively critique its own output?

Yes, with caveats. The same model can catch structural issues, missing information, and formatting problems reliably. It is less effective at catching its own factual hallucinations because the same knowledge gaps that caused the error also affect the critique. For critical accuracy requirements, use a separate verification step with tool-based fact checking.

### How do I prevent reflection loops that never converge?

Set a strict maximum on reflection cycles (2-3 is usually sufficient). Use the score-and-improve pattern with a numerical threshold so you have an objective stopping criterion. If scores are not improving between iterations, break the loop — further reflection is unlikely to help, and the issue may require a fundamentally different approach.

---

#SelfReflection #AIAgents #CritiqueLoops #QualityAssurance #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/self-reflection-in-ai-agents-building-systems-that-learn
