Building an Agent Evaluation Framework: Metrics, Datasets, and Automated Scoring

Why You Need a Structured Evaluation Framework

Deploying an AI agent without structured evaluation is like shipping software without tests. The agent might work perfectly in a demo, then fail spectacularly on the first edge case a real user throws at it. An evaluation framework gives you repeatable, quantitative measurements that tell you exactly where your agent excels and where it breaks.

A good framework has three pillars: metrics that capture what matters, datasets that represent real usage, and scoring pipelines that run automatically. Let us build each one from scratch.

Designing Your Metric Taxonomy

Metrics for AI agents fall into four categories. Each captures a different dimension of quality.

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, Any

class MetricCategory(Enum):
    TASK = "task_completion"
    QUALITY = "response_quality"
    EFFICIENCY = "efficiency"
    SAFETY = "safety"

@dataclass
class EvalMetric:
    name: str
    category: MetricCategory
    scorer: Callable[[dict], float]
    weight: float = 1.0
    description: str = ""

@dataclass
class EvalFramework:
    metrics: list[EvalMetric] = field(default_factory=list)

    def register(self, metric: EvalMetric):
        self.metrics.append(metric)

    def score(self, sample: dict) -> dict[str, float]:
        results = {}
        for metric in self.metrics:
            try:
                results[metric.name] = metric.scorer(sample)
            except Exception as e:
                results[metric.name] = 0.0
                results[f"{metric.name}_error"] = str(e)
        return results

    def weighted_aggregate(self, results: dict[str, float]) -> float:
        total_weight = sum(m.weight for m in self.metrics)
        weighted_sum = sum(
            results.get(m.name, 0.0) * m.weight
            for m in self.metrics
        )
        return weighted_sum / total_weight if total_weight > 0 else 0.0

This gives you a registry where each metric is a named scorer function. The framework runs all metrics against a sample and returns individual scores plus a weighted aggregate.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Creating Evaluation Datasets

Your dataset should mirror production traffic. Each sample includes the user input, the expected behavior, and any context the agent had access to.

import json
import hashlib
from datetime import datetime

@dataclass
class EvalSample:
    sample_id: str
    user_input: str
    expected_output: str
    expected_tool_calls: list[dict] = field(default_factory=list)
    context: dict = field(default_factory=dict)
    tags: list[str] = field(default_factory=list)
    difficulty: str = "medium"

class EvalDataset:
    def __init__(self, name: str, version: str = "1.0"):
        self.name = name
        self.version = version
        self.samples: list[EvalSample] = []
        self.created_at = datetime.utcnow().isoformat()

    def add_sample(self, sample: EvalSample):
        self.samples.append(sample)

    def fingerprint(self) -> str:
        content = json.dumps(
            [s.__dict__ for s in self.samples],
            sort_keys=True
        )
        return hashlib.sha256(content.encode()).hexdigest()[:12]

    def filter_by_tag(self, tag: str) -> list[EvalSample]:
        return [s for s in self.samples if tag in s.tags]

    def save(self, path: str):
        data = {
            "name": self.name,
            "version": self.version,
            "fingerprint": self.fingerprint(),
            "created_at": self.created_at,
            "samples": [s.__dict__ for s in self.samples],
        }
        with open(path, "w") as f:
            json.dump(data, f, indent=2)

The fingerprint ensures you always know exactly which dataset version produced a given set of results. Tag-based filtering lets you slice results by capability, difficulty, or domain.

Building the Automated Scoring Pipeline

The pipeline connects your agent, dataset, and metrics into a single runnable evaluation.

import asyncio
from typing import Protocol

class AgentRunner(Protocol):
    async def run(self, user_input: str, context: dict) -> dict:
        ...

async def run_evaluation(
    agent: AgentRunner,
    dataset: EvalDataset,
    framework: EvalFramework,
    concurrency: int = 5,
) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def evaluate_sample(sample: EvalSample):
        async with semaphore:
            agent_output = await agent.run(
                sample.user_input, sample.context
            )
            scored = framework.score({
                "sample": sample.__dict__,
                "output": agent_output,
            })
            scored["sample_id"] = sample.sample_id
            scored["aggregate"] = framework.weighted_aggregate(scored)
            return scored

    tasks = [evaluate_sample(s) for s in dataset.samples]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    return [
        r if isinstance(r, dict) else {"error": str(r)}
        for r in results
    ]

The semaphore controls concurrency so you do not overwhelm your agent or your LLM provider. Results come back with per-sample scores and an aggregate, ready for analysis.

Interpreting Results

After running the pipeline, aggregate results by category and tag to find patterns.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

from collections import defaultdict

def summarize_results(
    results: list[dict], framework: EvalFramework
) -> dict:
    category_scores = defaultdict(list)
    for result in results:
        for metric in framework.metrics:
            if metric.name in result:
                category_scores[metric.category.value].append(
                    result[metric.name]
                )

    summary = {}
    for category, scores in category_scores.items():
        summary[category] = {
            "mean": sum(scores) / len(scores),
            "min": min(scores),
            "max": max(scores),
            "count": len(scores),
        }
    return summary

Look for categories where the minimum score is significantly below the mean — those represent your worst failure modes and should be your top priority for improvement.

FAQ

How many evaluation samples do I need for reliable results?

Start with at least 50 to 100 samples per capability you want to measure. For statistical significance when comparing two agent versions, you typically need 200 or more samples. The key is coverage across edge cases, not raw volume. Ten diverse samples beat a hundred repetitive ones.

Should I use LLM-as-judge or deterministic scoring?

Use deterministic scoring wherever possible — exact match for tool calls, regex for structured outputs, keyword checks for required information. Reserve LLM-as-judge for subjective quality dimensions like helpfulness or coherence. Deterministic metrics are faster, cheaper, and reproducible.

How often should I re-run evaluations?

Run the full evaluation suite on every model change, prompt update, or tool modification. Set up nightly runs against your production configuration to catch regressions from upstream model updates. Store every result with the dataset fingerprint and agent version so you can track trends over time.

#AgentEvaluation #Benchmarking #Python #MLOps #Testing #AgenticAI #LearnAI #AIEngineering

Building an Agent Evaluation Framework: Metrics, Datasets, and Automated Scoring

Why You Need a Structured Evaluation Framework

Designing Your Metric Taxonomy

Creating Evaluation Datasets

Building the Automated Scoring Pipeline

Interpreting Results

FAQ

How many evaluation samples do I need for reliable results?

Should I use LLM-as-judge or deterministic scoring?

How often should I re-run evaluations?

Try CallSphere AI Voice Agents

Related Articles You May Like

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)

The Agent Evaluation Stack in 2026: From Trace to Eval Score

Evaluating Agent Memory: Recall, Precision, and the Eval Pipeline Most Teams Don't Build

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Regression Testing for AI Agents: Catching Silent Breakage Before Users Do

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026