Agent Evaluation Loops: Building Self-Correcting Workflows
Build iterative agent workflows where a task agent produces output and a feedback agent evaluates it, creating self-correcting loops with convergence criteria in the OpenAI Agents SDK.
The Problem with Single-Pass Agent Output
A single LLM call produces output that is often "good enough" but rarely excellent. The model generates a response, you return it to the user, and you hope for the best. In production systems, "hope" is not a quality strategy.
Human experts do not work in single passes. A writer drafts, reviews, revises, and reviews again. A software engineer writes code, runs tests, fixes failures, and re-tests. The revision loop is what separates professional output from first drafts.
Agent evaluation loops bring this same discipline to AI systems. A task agent produces output. A feedback agent evaluates that output against defined criteria. If the output does not meet the bar, the task agent receives the feedback and tries again. This loop continues until the output passes evaluation or a maximum iteration count is reached.
Architecture of an Evaluation Loop
The evaluation loop has four components:
flowchart TD
START["Agent Evaluation Loops: Building Self-Correcting …"] --> A
A["The Problem with Single-Pass Agent Outp…"]
A --> B
B["Architecture of an Evaluation Loop"]
B --> C
C["The Loop Controller"]
C --> D
D["Convergence Criteria Design"]
D --> E
E["Writing Effective Feedback Agents"]
E --> F
F["Preventing Infinite Loops and Oscillati…"]
F --> G
G["Practical Considerations"]
G --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
- Task Agent — produces the work product (a draft, code, analysis, or any structured output)
- Feedback Agent — evaluates the work product against quality criteria and produces actionable feedback
- Convergence Criteria — defines when the loop should stop (quality threshold met, maximum iterations reached, or no new feedback to give)
- Loop Controller — orchestrates the iteration, passing feedback back to the task agent and checking convergence
from pydantic import BaseModel
from agents import Agent
class ArticleDraft(BaseModel):
title: str
content: str
word_count: int
class FeedbackResult(BaseModel):
score: float # 0.0 to 1.0
passes: bool
strengths: list[str]
issues: list[str]
specific_suggestions: list[str]
task_agent = Agent(
name="ArticleWriter",
instructions="""You are a technical writer. Write or revise an article
based on the given topic and any feedback provided. Produce clear,
accurate, well-structured technical content. Target 800-1000 words.
If feedback from a previous iteration is provided, address every
specific issue mentioned. Do not simply rephrase — make substantive
improvements based on the feedback.""",
model="gpt-4o",
output_type=ArticleDraft,
)
feedback_agent = Agent(
name="ArticleReviewer",
instructions="""You are a senior technical editor. Evaluate the article
against these criteria:
1. Technical accuracy — are all claims correct and well-supported?
2. Clarity — can a mid-level developer follow the content?
3. Structure — does the article flow logically with clear transitions?
4. Completeness — are there gaps or unexplored angles?
5. Conciseness — is there unnecessary repetition or filler?
Score from 0.0 to 1.0. Set passes=True only if score >= 0.8.
Provide specific, actionable suggestions — not vague feedback like
"improve clarity" but concrete fixes like "the explanation of X in
paragraph 3 assumes knowledge of Y, add a brief definition".""",
model="gpt-4o",
output_type=FeedbackResult,
)
The Loop Controller
The loop controller is the orchestration logic that ties the task and feedback agents together. It manages iteration state, formats the feedback for re-input, and enforces convergence criteria.
import asyncio
import json
from dataclasses import dataclass, field
from agents import Runner
@dataclass
class LoopState:
iterations: list[dict] = field(default_factory=list)
final_output: any = None
converged: bool = False
total_iterations: int = 0
async def run_evaluation_loop(
task_agent: Agent,
feedback_agent: Agent,
initial_input: str,
max_iterations: int = 3,
quality_threshold: float = 0.8,
) -> LoopState:
"""Run a task-feedback evaluation loop until convergence."""
state = LoopState()
current_input = initial_input
for iteration in range(max_iterations):
state.total_iterations = iteration + 1
print(f"Iteration {iteration + 1}/{max_iterations}")
# Step 1: Task agent produces output
task_result = await Runner.run(task_agent, input=current_input)
draft = task_result.final_output
# Step 2: Feedback agent evaluates output
evaluation_input = (
f"Evaluate this article:\n\n"
f"Title: {draft.title}\n"
f"Content: {draft.content}\n"
f"Word count: {draft.word_count}"
)
feedback_result = await Runner.run(feedback_agent, input=evaluation_input)
feedback = feedback_result.final_output
# Record this iteration
state.iterations.append({
"iteration": iteration + 1,
"draft_title": draft.title,
"word_count": draft.word_count,
"score": feedback.score,
"passes": feedback.passes,
"issues_count": len(feedback.issues),
})
print(f" Score: {feedback.score:.2f} | "
f"Issues: {len(feedback.issues)} | "
f"Passes: {feedback.passes}")
# Step 3: Check convergence
if feedback.passes and feedback.score >= quality_threshold:
state.final_output = draft
state.converged = True
print(f" Converged after {iteration + 1} iterations")
return state
# Step 4: Format feedback for next iteration
feedback_text = "\n".join([
f"- {suggestion}"
for suggestion in feedback.specific_suggestions
])
issues_text = "\n".join([
f"- {issue}" for issue in feedback.issues
])
current_input = (
f"Revise the following article based on reviewer feedback.\n\n"
f"ORIGINAL TOPIC: {initial_input}\n\n"
f"YOUR PREVIOUS DRAFT:\n{draft.content}\n\n"
f"REVIEWER SCORE: {feedback.score:.2f}/1.0\n\n"
f"ISSUES TO FIX:\n{issues_text}\n\n"
f"SPECIFIC SUGGESTIONS:\n{feedback_text}"
)
# Max iterations reached without convergence
state.final_output = draft
print(f" Max iterations reached. Final score: {feedback.score:.2f}")
return state
Convergence Criteria Design
The quality threshold and maximum iteration count are the two primary convergence parameters, but production systems often need more nuanced criteria.
Diminishing returns detection. If the score improves by less than 0.05 between iterations, further loops are unlikely to help. The task agent is making minimal changes and the feedback agent is giving the same suggestions. Detect this and exit early.
def should_stop_early(iterations: list[dict], min_improvement: float = 0.05) -> bool:
"""Detect when the loop has stopped making meaningful progress."""
if len(iterations) < 2:
return False
recent_scores = [it["score"] for it in iterations[-2:]]
improvement = recent_scores[-1] - recent_scores[-2]
if improvement < min_improvement:
print(f" Early stop: improvement {improvement:.3f} below threshold {min_improvement}")
return True
return False
Issue count plateau. If the feedback agent identifies the same number of issues across two consecutive iterations, the task agent may be fixing some issues while introducing others. This is a signal to exit the loop and return the best iteration rather than the latest one.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Score regression. Sometimes a revision makes things worse. Track the best-scoring iteration and return it even if the final iteration scores lower.
Writing Effective Feedback Agents
The feedback agent is the most critical component in the loop. Vague feedback produces vague revisions. Here are patterns that produce useful feedback.
Criterion-specific scoring. Instead of a single overall score, have the feedback agent score each criterion independently. This tells the task agent exactly which dimensions need work.
class DetailedFeedback(BaseModel):
accuracy_score: float
clarity_score: float
structure_score: float
completeness_score: float
overall_score: float
passes: bool
issues: list[str]
specific_suggestions: list[str]
Quote-and-fix format. The most actionable feedback pattern is: quote the problematic text, explain why it is an issue, and suggest a specific fix. Instruct the feedback agent to use this format.
Separation of concerns. Just as you would not ask one human reviewer to simultaneously check grammar, technical accuracy, and strategic coherence, consider using multiple feedback agents that each focus on a single quality dimension. Their individual scores are aggregated by the loop controller.
Preventing Infinite Loops and Oscillation
Evaluation loops can enter pathological states. Two common failure modes are oscillation (the task agent alternates between two approaches based on contradictory feedback) and regression loops (each revision fixes one thing but breaks another).
Mitigation strategies include accumulating all previous feedback rather than only the latest round, instructing the task agent to preserve strengths identified in previous reviews, and setting a hard maximum iteration count that is never exceeded regardless of score.
Practical Considerations
Cost. Each iteration costs at least two LLM calls (task + feedback). A three-iteration loop on gpt-4o with 1000-token outputs costs roughly 6x a single call. Budget accordingly and use gpt-4o-mini for the feedback agent when the evaluation criteria are straightforward.
Latency. Evaluation loops multiply latency linearly with iteration count. For user-facing applications, limit to 2-3 iterations maximum. For background batch processing, you can afford more iterations.
Logging. Log every iteration with the full draft, feedback, and scores. This data is invaluable for improving your agents over time. You will discover patterns — which quality dimensions consistently fail, which feedback suggestions the task agent struggles to implement, and which topics require more iterations than others.
The evaluation loop pattern transforms AI output from "whatever the model produces" to "output that meets defined quality standards." It is one of the most practical patterns for production agentic systems.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.