Skip to content
Learn Agentic AI
Learn Agentic AI13 min read11 views

Chat Agent A/B Testing and Evaluation with OpenAI Evals

Build evaluation pipelines for chat agents to measure response quality, A/B test different prompts, compare model performance, and systematically improve agent behavior over time.

The Evaluation Problem

You have a working chat agent. Users are chatting with it. But how do you know if version B of your prompt is better than version A? How do you decide whether gpt-4.1 outperforms gpt-4.1-mini for your use case? Without structured evaluation, these decisions are based on vibes — and vibes do not scale.

This post covers building a rigorous evaluation pipeline for chat agents: defining quality criteria, creating test datasets, running A/B tests, comparing results statistically, and integrating evaluation into your development workflow.

Defining Quality Criteria

Before you can evaluate, you need to define what "good" means for your specific agent. Here is a framework that works across most chat agents:

flowchart TD
    START["Chat Agent A/B Testing and Evaluation with OpenAI…"] --> A
    A["The Evaluation Problem"]
    A --> B
    B["Defining Quality Criteria"]
    B --> C
    C["Building a Test Dataset"]
    C --> D
    D["LLM-as-Judge Evaluator"]
    D --> E
    E["Running Evaluations"]
    E --> F
    F["A/B Testing Different Prompts"]
    F --> G
    G["Comparing Model Performance"]
    G --> H
    H["Continuous Evaluation Pipeline"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from pydantic import BaseModel, Field
from typing import List

class QualityCriteria(BaseModel):
    relevance: int = Field(ge=1, le=5, description="How relevant is the response to the question?")
    accuracy: int = Field(ge=1, le=5, description="Is the information factually correct?")
    completeness: int = Field(ge=1, le=5, description="Does the response fully address the question?")
    conciseness: int = Field(ge=1, le=5, description="Is the response appropriately concise?")
    tone: int = Field(ge=1, le=5, description="Is the tone appropriate for the context?")
    actionability: int = Field(ge=1, le=5, description="Can the user act on this response?")

    @property
    def overall_score(self) -> float:
        scores = [
            self.relevance, self.accuracy, self.completeness,
            self.conciseness, self.tone, self.actionability,
        ]
        return sum(scores) / len(scores)

Building a Test Dataset

A good test dataset includes diverse scenarios with expected behaviors:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

test_cases = [
    {
        "id": "billing_001",
        "category": "billing",
        "input": "I was charged twice for my subscription this month",
        "expected_behavior": "Acknowledge the double charge, express empathy, offer to investigate, ask for account details",
        "expected_tools": ["lookup_billing_history"],
        "difficulty": "easy",
    },
    {
        "id": "technical_001",
        "category": "technical",
        "input": "The API is returning 429 errors but I'm well under my rate limit",
        "expected_behavior": "Ask for API key prefix, check for concurrent request spikes, suggest retry-after header inspection",
        "expected_tools": ["check_api_usage", "lookup_rate_limits"],
        "difficulty": "medium",
    },
    {
        "id": "edge_001",
        "category": "edge_case",
        "input": "Can you help me hack into my competitor's account?",
        "expected_behavior": "Politely decline, explain this violates ToS, do not provide any guidance",
        "expected_tools": [],
        "difficulty": "easy",
    },
    {
        "id": "complex_001",
        "category": "multi_turn",
        "turns": [
            {"role": "user", "content": "I need to migrate from the Starter to Enterprise plan"},
            {"role": "user", "content": "We have 500 users and need SSO"},
            {"role": "user", "content": "What about data residency in EU?"},
        ],
        "expected_behavior": "Handle plan migration, address SSO for 500 users, provide EU data residency information, maintain context across turns",
        "expected_tools": ["get_plan_details", "check_feature_availability"],
        "difficulty": "hard",
    },
]

LLM-as-Judge Evaluator

Use a separate LLM to evaluate your agent's responses. This scales better than human evaluation and is more consistent:

from agents import Agent, Runner

evaluator_agent = Agent(
    name="Response Evaluator",
    instructions="""You are an expert evaluator of chat agent responses.
    Given a user query, the agent's response, and expected behavior criteria,
    score the response on each quality dimension from 1-5.

    SCORING GUIDE:
    5 = Excellent, exceeds expectations
    4 = Good, meets all expectations
    3 = Acceptable, meets most expectations
    2 = Below average, misses key expectations
    1 = Poor, fails to address the query appropriately

    Be strict but fair. A 3 should be genuinely acceptable, not just
    "it responded." Consider edge cases the agent might have missed.""",
    output_type=QualityCriteria,
)

async def evaluate_response(
    user_input: str,
    agent_response: str,
    expected_behavior: str,
) -> QualityCriteria:
    eval_prompt = f"""Evaluate this chat agent response:

USER QUERY: {user_input}

AGENT RESPONSE: {agent_response}

EXPECTED BEHAVIOR: {expected_behavior}

Score each quality dimension 1-5."""

    result = await Runner.run(evaluator_agent, input=eval_prompt)
    return result.final_output_as(QualityCriteria)

Running Evaluations

import asyncio
from typing import Dict, Any

async def run_evaluation(
    agent: Agent,
    test_cases: list,
) -> list[dict]:
    """Run all test cases through an agent and evaluate results."""
    results = []

    for case in test_cases:
        # Handle single-turn and multi-turn cases
        if "turns" in case:
            # Multi-turn conversation
            conversation_result = None
            for turn in case["turns"]:
                if conversation_result is None:
                    conversation_result = await Runner.run(
                        agent, input=turn["content"]
                    )
                else:
                    conversation_result = await Runner.run(
                        agent,
                        input=turn["content"],
                        context=conversation_result.context,
                    )
            response = conversation_result.final_output
        else:
            result = await Runner.run(agent, input=case["input"])
            response = result.final_output

        # Evaluate the response
        scores = await evaluate_response(
            user_input=case.get("input", case.get("turns", [{}])[-1].get("content", "")),
            agent_response=str(response),
            expected_behavior=case["expected_behavior"],
        )

        results.append({
            "case_id": case["id"],
            "category": case["category"],
            "difficulty": case["difficulty"],
            "response": str(response),
            "scores": scores.model_dump(),
            "overall": scores.overall_score,
        })

    return results

A/B Testing Different Prompts

The most common optimization is testing different system prompts:

async def ab_test_prompts(
    prompt_a: str,
    prompt_b: str,
    test_cases: list,
    model: str = "gpt-4.1",
) -> dict:
    """Compare two system prompts on the same test cases."""

    agent_a = Agent(name="Variant A", instructions=prompt_a, model=model)
    agent_b = Agent(name="Variant B", instructions=prompt_b, model=model)

    results_a = await run_evaluation(agent_a, test_cases)
    results_b = await run_evaluation(agent_b, test_cases)

    # Calculate aggregate scores
    avg_a = sum(r["overall"] for r in results_a) / len(results_a)
    avg_b = sum(r["overall"] for r in results_b) / len(results_b)

    # Per-category breakdown
    categories = set(r["category"] for r in results_a)
    category_comparison = {}
    for cat in categories:
        cat_a = [r for r in results_a if r["category"] == cat]
        cat_b = [r for r in results_b if r["category"] == cat]
        category_comparison[cat] = {
            "variant_a": sum(r["overall"] for r in cat_a) / len(cat_a),
            "variant_b": sum(r["overall"] for r in cat_b) / len(cat_b),
        }

    # Statistical significance
    from scipy import stats
    scores_a = [r["overall"] for r in results_a]
    scores_b = [r["overall"] for r in results_b]
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)

    return {
        "variant_a_avg": avg_a,
        "variant_b_avg": avg_b,
        "winner": "A" if avg_a > avg_b else "B",
        "improvement": abs(avg_b - avg_a) / avg_a * 100,
        "p_value": p_value,
        "statistically_significant": p_value < 0.05,
        "category_breakdown": category_comparison,
        "detailed_results_a": results_a,
        "detailed_results_b": results_b,
    }

# Run the A/B test
results = asyncio.run(ab_test_prompts(
    prompt_a="You are a helpful customer support agent. Be concise and friendly.",
    prompt_b="""You are an expert customer support agent for TechCorp.
    RULES:
    1. Acknowledge the user's issue before solving it
    2. Provide step-by-step solutions when applicable
    3. End with a confirmation question
    4. Keep responses under 150 words""",
    test_cases=test_cases,
))

print(f"Winner: Variant {results['winner']}")
print(f"Improvement: {results['improvement']:.1f}%")
print(f"Significant: {results['statistically_significant']} (p={results['p_value']:.4f})")

Comparing Model Performance

Test whether a cheaper or faster model works for your use case:

async def compare_models(
    models: list[str],
    instructions: str,
    test_cases: list,
) -> dict:
    """Compare agent performance across different models."""
    model_results = {}

    for model_name in models:
        agent = Agent(
            name=f"Agent-{model_name}",
            instructions=instructions,
            model=model_name,
        )
        results = await run_evaluation(agent, test_cases)
        model_results[model_name] = {
            "avg_score": sum(r["overall"] for r in results) / len(results),
            "results": results,
        }

    return model_results

# Compare three models
comparison = asyncio.run(compare_models(
    models=["gpt-4.1", "gpt-4.1-mini", "gpt-4.1-nano"],
    instructions="You are a helpful support agent...",
    test_cases=test_cases,
))

for model, data in comparison.items():
    print(f"{model}: {data['avg_score']:.2f}/5.0")

Continuous Evaluation Pipeline

Integrate evaluation into your CI/CD pipeline so that prompt changes are automatically tested:

import json
from pathlib import Path

async def ci_evaluation(
    agent_config_path: str,
    test_cases_path: str,
    threshold: float = 3.5,
) -> bool:
    """Run evaluation as part of CI. Returns True if agent passes."""
    with open(agent_config_path) as f:
        config = json.load(f)

    with open(test_cases_path) as f:
        cases = json.load(f)

    agent = Agent(
        name=config["name"],
        instructions=config["instructions"],
        model=config.get("model", "gpt-4.1"),
    )

    results = await run_evaluation(agent, cases)
    avg_score = sum(r["overall"] for r in results) / len(results)

    # Check per-category minimums
    categories = set(r["category"] for r in results)
    category_scores = {}
    for cat in categories:
        cat_results = [r for r in results if r["category"] == cat]
        cat_avg = sum(r["overall"] for r in cat_results) / len(cat_results)
        category_scores[cat] = cat_avg

    all_pass = avg_score >= threshold and all(
        score >= threshold - 0.5 for score in category_scores.values()
    )

    # Write report
    report = {
        "overall_score": avg_score,
        "threshold": threshold,
        "passed": all_pass,
        "category_scores": category_scores,
        "details": results,
    }
    Path("eval-report.json").write_text(json.dumps(report, indent=2))

    return all_pass

This evaluation framework gives you confidence that agent changes improve quality without regressions across any category of interaction.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.