The Problem with Single-Pass Agent Output

A single LLM call produces output that is often "good enough" but rarely excellent. The model generates a response, you return it to the user, and you hope for the best. In production systems, "hope" is not a quality strategy.

Human experts do not work in single passes. A writer drafts, reviews, revises, and reviews again. A software engineer writes code, runs tests, fixes failures, and re-tests. The revision loop is what separates professional output from first drafts.

Agent evaluation loops bring this same discipline to AI systems. A task agent produces output. A feedback agent evaluates that output against defined criteria. If the output does not meet the bar, the task agent receives the feedback and tries again. This loop continues until the output passes evaluation or a maximum iteration count is reached.

Architecture of an Evaluation Loop

The evaluation loop has four components:

flowchart LR
    INPUT(["User input"])
    AGENT["Agent<br/>name plus instructions"]
    HAND{"Handoff to<br/>another agent?"}
    SUB["Sub-agent<br/>specialist"]
    GUARD{"Guardrail<br/>passed?"}
    TOOL["Tool call"]
    SDK[("Tracing<br/>OpenAI dashboard")]
    OUT(["Final output"])
    INPUT --> AGENT --> HAND
    HAND -->|Yes| SUB --> GUARD
    HAND -->|No| GUARD
    GUARD -->|Yes| TOOL --> AGENT
    GUARD -->|Block| OUT
    AGENT --> OUT
    AGENT --> SDK
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style SDK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

Task Agent — produces the work product (a draft, code, analysis, or any structured output)
Feedback Agent — evaluates the work product against quality criteria and produces actionable feedback
Convergence Criteria — defines when the loop should stop (quality threshold met, maximum iterations reached, or no new feedback to give)
Loop Controller — orchestrates the iteration, passing feedback back to the task agent and checking convergence

from pydantic import BaseModel
from agents import Agent

class ArticleDraft(BaseModel):
    title: str
    content: str
    word_count: int

class FeedbackResult(BaseModel):
    score: float  # 0.0 to 1.0
    passes: bool
    strengths: list[str]
    issues: list[str]
    specific_suggestions: list[str]

task_agent = Agent(
    name="ArticleWriter",
    instructions="""You are a technical writer. Write or revise an article
    based on the given topic and any feedback provided. Produce clear,
    accurate, well-structured technical content. Target 800-1000 words.

    If feedback from a previous iteration is provided, address every
    specific issue mentioned. Do not simply rephrase — make substantive
    improvements based on the feedback.""",
    model="gpt-4o",
    output_type=ArticleDraft,
)

feedback_agent = Agent(
    name="ArticleReviewer",
    instructions="""You are a senior technical editor. Evaluate the article
    against these criteria:

    1. Technical accuracy — are all claims correct and well-supported?
    2. Clarity — can a mid-level developer follow the content?
    3. Structure — does the article flow logically with clear transitions?
    4. Completeness — are there gaps or unexplored angles?
    5. Conciseness — is there unnecessary repetition or filler?

    Score from 0.0 to 1.0. Set passes=True only if score >= 0.8.
    Provide specific, actionable suggestions — not vague feedback like
    "improve clarity" but concrete fixes like "the explanation of X in
    paragraph 3 assumes knowledge of Y, add a brief definition".""",
    model="gpt-4o",
    output_type=FeedbackResult,
)

The Loop Controller

The loop controller is the orchestration logic that ties the task and feedback agents together. It manages iteration state, formats the feedback for re-input, and enforces convergence criteria.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import asyncio
import json
from dataclasses import dataclass, field
from agents import Runner

@dataclass
class LoopState:
    iterations: list[dict] = field(default_factory=list)
    final_output: any = None
    converged: bool = False
    total_iterations: int = 0

async def run_evaluation_loop(
    task_agent: Agent,
    feedback_agent: Agent,
    initial_input: str,
    max_iterations: int = 3,
    quality_threshold: float = 0.8,
) -> LoopState:
    """Run a task-feedback evaluation loop until convergence."""
    state = LoopState()
    current_input = initial_input

    for iteration in range(max_iterations):
        state.total_iterations = iteration + 1
        print(f"Iteration {iteration + 1}/{max_iterations}")

        # Step 1: Task agent produces output
        task_result = await Runner.run(task_agent, input=current_input)
        draft = task_result.final_output

        # Step 2: Feedback agent evaluates output
        evaluation_input = (
            f"Evaluate this article:\n\n"
            f"Title: {draft.title}\n"
            f"Content: {draft.content}\n"
            f"Word count: {draft.word_count}"
        )
        feedback_result = await Runner.run(feedback_agent, input=evaluation_input)
        feedback = feedback_result.final_output

        # Record this iteration
        state.iterations.append({
            "iteration": iteration + 1,
            "draft_title": draft.title,
            "word_count": draft.word_count,
            "score": feedback.score,
            "passes": feedback.passes,
            "issues_count": len(feedback.issues),
        })

        print(f"  Score: {feedback.score:.2f} | "
              f"Issues: {len(feedback.issues)} | "
              f"Passes: {feedback.passes}")

        # Step 3: Check convergence
        if feedback.passes and feedback.score >= quality_threshold:
            state.final_output = draft
            state.converged = True
            print(f"  Converged after {iteration + 1} iterations")
            return state

        # Step 4: Format feedback for next iteration
        feedback_text = "\n".join([
            f"- {suggestion}"
            for suggestion in feedback.specific_suggestions
        ])
        issues_text = "\n".join([
            f"- {issue}" for issue in feedback.issues
        ])

        current_input = (
            f"Revise the following article based on reviewer feedback.\n\n"
            f"ORIGINAL TOPIC: {initial_input}\n\n"
            f"YOUR PREVIOUS DRAFT:\n{draft.content}\n\n"
            f"REVIEWER SCORE: {feedback.score:.2f}/1.0\n\n"
            f"ISSUES TO FIX:\n{issues_text}\n\n"
            f"SPECIFIC SUGGESTIONS:\n{feedback_text}"
        )

    # Max iterations reached without convergence
    state.final_output = draft
    print(f"  Max iterations reached. Final score: {feedback.score:.2f}")
    return state

Convergence Criteria Design

The quality threshold and maximum iteration count are the two primary convergence parameters, but production systems often need more nuanced criteria.

Diminishing returns detection. If the score improves by less than 0.05 between iterations, further loops are unlikely to help. The task agent is making minimal changes and the feedback agent is giving the same suggestions. Detect this and exit early.

def should_stop_early(iterations: list[dict], min_improvement: float = 0.05) -> bool:
    """Detect when the loop has stopped making meaningful progress."""
    if len(iterations) < 2:
        return False

    recent_scores = [it["score"] for it in iterations[-2:]]
    improvement = recent_scores[-1] - recent_scores[-2]

    if improvement < min_improvement:
        print(f"  Early stop: improvement {improvement:.3f} below threshold {min_improvement}")
        return True

    return False

Issue count plateau. If the feedback agent identifies the same number of issues across two consecutive iterations, the task agent may be fixing some issues while introducing others. This is a signal to exit the loop and return the best iteration rather than the latest one.

Score regression. Sometimes a revision makes things worse. Track the best-scoring iteration and return it even if the final iteration scores lower.

Writing Effective Feedback Agents

The feedback agent is the most critical component in the loop. Vague feedback produces vague revisions. Here are patterns that produce useful feedback.

Criterion-specific scoring. Instead of a single overall score, have the feedback agent score each criterion independently. This tells the task agent exactly which dimensions need work.

class DetailedFeedback(BaseModel):
    accuracy_score: float
    clarity_score: float
    structure_score: float
    completeness_score: float
    overall_score: float
    passes: bool
    issues: list[str]
    specific_suggestions: list[str]

Quote-and-fix format. The most actionable feedback pattern is: quote the problematic text, explain why it is an issue, and suggest a specific fix. Instruct the feedback agent to use this format.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Separation of concerns. Just as you would not ask one human reviewer to simultaneously check grammar, technical accuracy, and strategic coherence, consider using multiple feedback agents that each focus on a single quality dimension. Their individual scores are aggregated by the loop controller.

Preventing Infinite Loops and Oscillation

Evaluation loops can enter pathological states. Two common failure modes are oscillation (the task agent alternates between two approaches based on contradictory feedback) and regression loops (each revision fixes one thing but breaks another).

Mitigation strategies include accumulating all previous feedback rather than only the latest round, instructing the task agent to preserve strengths identified in previous reviews, and setting a hard maximum iteration count that is never exceeded regardless of score.

Practical Considerations

Cost. Each iteration costs at least two LLM calls (task + feedback). A three-iteration loop on gpt-4o with 1000-token outputs costs roughly 6x a single call. Budget accordingly and use gpt-4o-mini for the feedback agent when the evaluation criteria are straightforward.

Latency. Evaluation loops multiply latency linearly with iteration count. For user-facing applications, limit to 2-3 iterations maximum. For background batch processing, you can afford more iterations.

Logging. Log every iteration with the full draft, feedback, and scores. This data is invaluable for improving your agents over time. You will discover patterns — which quality dimensions consistently fail, which feedback suggestions the task agent struggles to implement, and which topics require more iterations than others.

The evaluation loop pattern transforms AI output from "whatever the model produces" to "output that meets defined quality standards." It is one of the most practical patterns for production agentic systems.

Agent Evaluation Loops: Building Self-Correcting Workflows

The Problem with Single-Pass Agent Output

Architecture of an Evaluation Loop

The Loop Controller

Convergence Criteria Design

Writing Effective Feedback Agents

Preventing Infinite Loops and Oscillation

Practical Considerations

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Model-Native Harness: Why OpenAI and Anthropic Are Killing ReAct Loops

GPT-Realtime-Whisper vs Deepgram: Streaming STT in 2026