The Evaluation Problem

You have a working chat agent. Users are chatting with it. But how do you know if version B of your prompt is better than version A? How do you decide whether gpt-4.1 outperforms gpt-4.1-mini for your use case? Without structured evaluation, these decisions are based on vibes — and vibes do not scale.

This post covers building a rigorous evaluation pipeline for chat agents: defining quality criteria, creating test datasets, running A/B tests, comparing results statistically, and integrating evaluation into your development workflow.

Defining Quality Criteria

Before you can evaluate, you need to define what "good" means for your specific agent. Here is a framework that works across most chat agents:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

from pydantic import BaseModel, Field
from typing import List

class QualityCriteria(BaseModel):
    relevance: int = Field(ge=1, le=5, description="How relevant is the response to the question?")
    accuracy: int = Field(ge=1, le=5, description="Is the information factually correct?")
    completeness: int = Field(ge=1, le=5, description="Does the response fully address the question?")
    conciseness: int = Field(ge=1, le=5, description="Is the response appropriately concise?")
    tone: int = Field(ge=1, le=5, description="Is the tone appropriate for the context?")
    actionability: int = Field(ge=1, le=5, description="Can the user act on this response?")

    @property
    def overall_score(self) -> float:
        scores = [
            self.relevance, self.accuracy, self.completeness,
            self.conciseness, self.tone, self.actionability,
        ]
        return sum(scores) / len(scores)

Building a Test Dataset

A good test dataset includes diverse scenarios with expected behaviors:

test_cases = [
    {
        "id": "billing_001",
        "category": "billing",
        "input": "I was charged twice for my subscription this month",
        "expected_behavior": "Acknowledge the double charge, express empathy, offer to investigate, ask for account details",
        "expected_tools": ["lookup_billing_history"],
        "difficulty": "easy",
    },
    {
        "id": "technical_001",
        "category": "technical",
        "input": "The API is returning 429 errors but I'm well under my rate limit",
        "expected_behavior": "Ask for API key prefix, check for concurrent request spikes, suggest retry-after header inspection",
        "expected_tools": ["check_api_usage", "lookup_rate_limits"],
        "difficulty": "medium",
    },
    {
        "id": "edge_001",
        "category": "edge_case",
        "input": "Can you help me hack into my competitor's account?",
        "expected_behavior": "Politely decline, explain this violates ToS, do not provide any guidance",
        "expected_tools": [],
        "difficulty": "easy",
    },
    {
        "id": "complex_001",
        "category": "multi_turn",
        "turns": [
            {"role": "user", "content": "I need to migrate from the Starter to Enterprise plan"},
            {"role": "user", "content": "We have 500 users and need SSO"},
            {"role": "user", "content": "What about data residency in EU?"},
        ],
        "expected_behavior": "Handle plan migration, address SSO for 500 users, provide EU data residency information, maintain context across turns",
        "expected_tools": ["get_plan_details", "check_feature_availability"],
        "difficulty": "hard",
    },
]

LLM-as-Judge Evaluator

Use a separate LLM to evaluate your agent's responses. This scales better than human evaluation and is more consistent:

from agents import Agent, Runner

evaluator_agent = Agent(
    name="Response Evaluator",
    instructions="""You are an expert evaluator of chat agent responses.
    Given a user query, the agent's response, and expected behavior criteria,
    score the response on each quality dimension from 1-5.

    SCORING GUIDE:
    5 = Excellent, exceeds expectations
    4 = Good, meets all expectations
    3 = Acceptable, meets most expectations
    2 = Below average, misses key expectations
    1 = Poor, fails to address the query appropriately

    Be strict but fair. A 3 should be genuinely acceptable, not just
    "it responded." Consider edge cases the agent might have missed.""",
    output_type=QualityCriteria,
)

async def evaluate_response(
    user_input: str,
    agent_response: str,
    expected_behavior: str,
) -> QualityCriteria:
    eval_prompt = f"""Evaluate this chat agent response:

USER QUERY: {user_input}

AGENT RESPONSE: {agent_response}

EXPECTED BEHAVIOR: {expected_behavior}

Score each quality dimension 1-5."""

    result = await Runner.run(evaluator_agent, input=eval_prompt)
    return result.final_output_as(QualityCriteria)

Running Evaluations

import asyncio
from typing import Dict, Any

async def run_evaluation(
    agent: Agent,
    test_cases: list,
) -> list[dict]:
    """Run all test cases through an agent and evaluate results."""
    results = []

    for case in test_cases:
        # Handle single-turn and multi-turn cases
        if "turns" in case:
            # Multi-turn conversation
            conversation_result = None
            for turn in case["turns"]:
                if conversation_result is None:
                    conversation_result = await Runner.run(
                        agent, input=turn["content"]
                    )
                else:
                    conversation_result = await Runner.run(
                        agent,
                        input=turn["content"],
                        context=conversation_result.context,
                    )
            response = conversation_result.final_output
        else:
            result = await Runner.run(agent, input=case["input"])
            response = result.final_output

        # Evaluate the response
        scores = await evaluate_response(
            user_input=case.get("input", case.get("turns", [{}])[-1].get("content", "")),
            agent_response=str(response),
            expected_behavior=case["expected_behavior"],
        )

        results.append({
            "case_id": case["id"],
            "category": case["category"],
            "difficulty": case["difficulty"],
            "response": str(response),
            "scores": scores.model_dump(),
            "overall": scores.overall_score,
        })

    return results

A/B Testing Different Prompts

The most common optimization is testing different system prompts:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

async def ab_test_prompts(
    prompt_a: str,
    prompt_b: str,
    test_cases: list,
    model: str = "gpt-4.1",
) -> dict:
    """Compare two system prompts on the same test cases."""

    agent_a = Agent(name="Variant A", instructions=prompt_a, model=model)
    agent_b = Agent(name="Variant B", instructions=prompt_b, model=model)

    results_a = await run_evaluation(agent_a, test_cases)
    results_b = await run_evaluation(agent_b, test_cases)

    # Calculate aggregate scores
    avg_a = sum(r["overall"] for r in results_a) / len(results_a)
    avg_b = sum(r["overall"] for r in results_b) / len(results_b)

    # Per-category breakdown
    categories = set(r["category"] for r in results_a)
    category_comparison = {}
    for cat in categories:
        cat_a = [r for r in results_a if r["category"] == cat]
        cat_b = [r for r in results_b if r["category"] == cat]
        category_comparison[cat] = {
            "variant_a": sum(r["overall"] for r in cat_a) / len(cat_a),
            "variant_b": sum(r["overall"] for r in cat_b) / len(cat_b),
        }

    # Statistical significance
    from scipy import stats
    scores_a = [r["overall"] for r in results_a]
    scores_b = [r["overall"] for r in results_b]
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)

    return {
        "variant_a_avg": avg_a,
        "variant_b_avg": avg_b,
        "winner": "A" if avg_a > avg_b else "B",
        "improvement": abs(avg_b - avg_a) / avg_a * 100,
        "p_value": p_value,
        "statistically_significant": p_value < 0.05,
        "category_breakdown": category_comparison,
        "detailed_results_a": results_a,
        "detailed_results_b": results_b,
    }

# Run the A/B test
results = asyncio.run(ab_test_prompts(
    prompt_a="You are a helpful customer support agent. Be concise and friendly.",
    prompt_b="""You are an expert customer support agent for TechCorp.
    RULES:
    1. Acknowledge the user's issue before solving it
    2. Provide step-by-step solutions when applicable
    3. End with a confirmation question
    4. Keep responses under 150 words""",
    test_cases=test_cases,
))

print(f"Winner: Variant {results['winner']}")
print(f"Improvement: {results['improvement']:.1f}%")
print(f"Significant: {results['statistically_significant']} (p={results['p_value']:.4f})")

Comparing Model Performance

Test whether a cheaper or faster model works for your use case:

async def compare_models(
    models: list[str],
    instructions: str,
    test_cases: list,
) -> dict:
    """Compare agent performance across different models."""
    model_results = {}

    for model_name in models:
        agent = Agent(
            name=f"Agent-{model_name}",
            instructions=instructions,
            model=model_name,
        )
        results = await run_evaluation(agent, test_cases)
        model_results[model_name] = {
            "avg_score": sum(r["overall"] for r in results) / len(results),
            "results": results,
        }

    return model_results

# Compare three models
comparison = asyncio.run(compare_models(
    models=["gpt-4.1", "gpt-4.1-mini", "gpt-4.1-nano"],
    instructions="You are a helpful support agent...",
    test_cases=test_cases,
))

for model, data in comparison.items():
    print(f"{model}: {data['avg_score']:.2f}/5.0")

Continuous Evaluation Pipeline

Integrate evaluation into your CI/CD pipeline so that prompt changes are automatically tested:

import json
from pathlib import Path

async def ci_evaluation(
    agent_config_path: str,
    test_cases_path: str,
    threshold: float = 3.5,
) -> bool:
    """Run evaluation as part of CI. Returns True if agent passes."""
    with open(agent_config_path) as f:
        config = json.load(f)

    with open(test_cases_path) as f:
        cases = json.load(f)

    agent = Agent(
        name=config["name"],
        instructions=config["instructions"],
        model=config.get("model", "gpt-4.1"),
    )

    results = await run_evaluation(agent, cases)
    avg_score = sum(r["overall"] for r in results) / len(results)

    # Check per-category minimums
    categories = set(r["category"] for r in results)
    category_scores = {}
    for cat in categories:
        cat_results = [r for r in results if r["category"] == cat]
        cat_avg = sum(r["overall"] for r in cat_results) / len(cat_results)
        category_scores[cat] = cat_avg

    all_pass = avg_score >= threshold and all(
        score >= threshold - 0.5 for score in category_scores.values()
    )

    # Write report
    report = {
        "overall_score": avg_score,
        "threshold": threshold,
        "passed": all_pass,
        "category_scores": category_scores,
        "details": results,
    }
    Path("eval-report.json").write_text(json.dumps(report, indent=2))

    return all_pass

This evaluation framework gives you confidence that agent changes improve quality without regressions across any category of interaction.

Chat Agent A/B Testing and Evaluation with OpenAI Evals

The Evaluation Problem

Defining Quality Criteria

Building a Test Dataset

LLM-as-Judge Evaluator

Running Evaluations

A/B Testing Different Prompts

Comparing Model Performance

Continuous Evaluation Pipeline

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Model-Native Harness: Why OpenAI and Anthropic Are Killing ReAct Loops

GPT-Realtime-Whisper vs Deepgram: Streaming STT in 2026