Skip to content
Agentic AI
Agentic AI13 min read0 views

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.

TL;DR

The agent evaluation stack in 2026 is a six-stage pipeline: instrument → trace → dataset → evaluator → score → CI gate. Skip a stage and you ship regressions. I've watched teams burn entire quarters chasing eval theater — colorful dashboards, no signal — because they treated evaluation like a one-time vibe check instead of an always-on production loop. The reference implementation most teams converge on uses LangSmith for tracing, datasets, and evaluators, with pairwise LLM-as-judge wired into pull-request CI. This post walks the entire flow, including code you can paste, two mermaid diagrams of the data path, and the honest tradeoffs between online and offline eval. If you only build one part of this stack first, build the dataset of curated traces — everything else is plumbing around that asset.

Why "Eval" Means Something Different for Agents

When people first encounter LLM evaluation, they think of MMLU, HumanEval, GSM8K — academic benchmarks where there is a known answer and you compute accuracy. Agent evaluation is almost the opposite. The "input" isn't a prompt; it's a user goal plus tool environment. The "output" isn't a token; it's a trajectory — a sequence of model decisions, tool calls, retrieval hits, and final responses, often spanning 10-30 LLM calls per session. There is no single ground truth, latency matters as much as correctness, and the same input can legitimately produce three different acceptable outputs.

That changes what you measure. Traditional NLP metrics (BLEU, ROUGE, exact-match) collapse on agents. You need trajectory-aware evaluators — graders that look at the whole trace, not just the last message. You need reference-free evaluators for the long tail where ground truth doesn't exist. And you need a continuous loop: production traces flow back into the dataset, the dataset is rerun against new agent versions, and the experiment results gate deploys. The end state is a stack, not a script.

I'll define the stack first, then walk every stage with code.

The Six-Stage Stack, At A Glance

flowchart LR
  A[Production Agent] -->|emit spans| B[Tracing Layer]
  B --> C[(Trace Store)]
  C -->|curate examples| D[(Dataset)]
  D --> E[Experiment Runner]
  F[Candidate Agent vN+1] --> E
  E --> G[Evaluators]
  G --> H[(Eval Scores)]
  H --> I{CI Gate}
  I -->|pass| J[Deploy]
  I -->|fail| K[Block PR]
  A -->|online evals| G
  G -->|annotation queue| L[Human Reviewers]
  L --> D

The arrows that matter most are the two feedback loops: production traces flowing back into the dataset, and human annotations refining what the evaluator considers "good." Without those loops, your dataset goes stale in roughly six weeks and your evaluators drift away from real user behavior. Build the loops on day one.

Stage 1: Instrument — Spans, Not Print Statements

You cannot evaluate what you cannot see. The first thing to ship is span-level tracing wrapping every LLM call, every tool call, and every retrieval. The OpenTelemetry-flavored model that LangSmith, Arize, and Langfuse all converge on uses runs (or "traces") composed of nested spans, each tagged with inputs, outputs, latency, token counts, and arbitrary metadata.

Here is the smallest possible LangSmith instrumentation that gives you a usable trace tree:

import os
from langsmith import traceable, Client
from openai import OpenAI

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "callsphere-agent-prod"

client = OpenAI()
ls = Client()

@traceable(run_type="tool", name="lookup_account")
def lookup_account(account_id: str) -> dict:
    # ... real DB call ...
    return {"id": account_id, "tier": "growth", "minutes_used": 4823}

@traceable(run_type="llm", name="reasoner")
def reason(messages: list[dict]) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=[{"type": "function", "function": {"name": "lookup_account"}}],
    )
    return resp.choices[0].message.content

@traceable(run_type="chain", name="support_agent")
def support_agent(user_query: str, account_id: str) -> str:
    account = lookup_account(account_id)
    return reason([
        {"role": "system", "content": "You are a support agent."},
        {"role": "user", "content": f"Account: {account}. Query: {user_query}"},
    ])

Three things to notice. First, @traceable nests automatically — the support_agent run becomes the parent, lookup_account and reason become children, and you get a tree view in the LangSmith UI for free. Second, every span carries inputs/outputs you'll later use as evaluator input. Third, run_type matters: it's how filters in datasets and online evals select which spans to score. Tag aggressively — tool, llm, chain, retriever, parser — because you'll thank yourself the first time you need to evaluate just the retrieval step in isolation.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Stage 2: Trace — Capture the Full Trajectory

A trace is more than a log. It's a structured object you'll later replay, score, edit, and clone. Best practice is to capture five things on every parent run: the user-facing input, the final output, the full message history (every intermediate LLM call), every tool I/O, and metadata like user_id, session_id, model_version, and feature flags. The metadata is what lets you slice the dataset later — e.g., "show me all traces from users on the new prompt where tool_calls > 3 and final latency > 4s."

For agent eval specifically, you want trajectory replay-ability. That means deterministic seeds where possible, hashed prompts so you can detect when a system prompt mutated mid-session, and tool stubs so a unit-test rerun doesn't actually charge a customer's credit card. Most teams underinvest here and pay for it later when they can't reproduce a failure.

Stage 3: Dataset — Curate, Don't Hoard

Production traffic is a fire hose. A dataset is a curated subset that represents the distribution you actually care about. The mistake I see most often is teams dumping 50,000 random traces into a "dataset" and calling it done. That's not a dataset, that's a backup. A real eval dataset is balanced, labeled, and small enough to rerun in under 10 minutes. Aim for 200-800 examples on launch, growing to 2-5k for mature systems.

How to build it:

Source What it gives you Watch out for
Curated production traces Real user distribution Privacy/PII leakage
Hand-written edge cases Coverage of rare failure modes Drift from real usage
Synthetic generation Cheap volume Generator bias
Adversarial / red-team Safety + jailbreak coverage Over-indexing on theater
Human annotations Ground truth labels Annotator disagreement

In LangSmith, a Dataset is a first-class object you can grow over time. The pattern that works: a daily cron pulls last-24h traces, samples by stratified slice (intent type, user tier, latency bucket), routes the sample to an annotation queue for human label, and merges approved examples into the dataset. The Datasets and Annotation Queues primitives in LangSmith are designed for exactly this loop.

import { Client } from "langsmith";

const ls = new Client();

// Create or get the canonical dataset
const dataset = await ls.createDataset("support-agent-eval-v3", {
  description: "Curated support traces, stratified by intent",
});

// Add examples sourced from production traces
await ls.createExamples({
  inputs: [
    { user_query: "Why was my call dropped at minute 7?", account_id: "acct_881" },
    { user_query: "How do I export last month's transcripts?", account_id: "acct_204" },
  ],
  outputs: [
    { expected_intent: "diagnose_dropped_call", must_call_tool: "fetch_call_log" },
    { expected_intent: "export_transcripts", must_call_tool: "create_export_job" },
  ],
  datasetId: dataset.id,
});

Notice the outputs aren't full reference answers — they're partial constraints: which intent the agent must classify, which tool it must call. This is a key trick for agent eval. Full reference answers don't exist for most agent outputs, but you can almost always state structural constraints. Evaluators score against the constraints, not against a golden string.

Stage 4: Evaluator — Pick the Right Tool for the Score

Evaluators in 2026 fall into four families. You will use all four; the question is in what ratio.

  1. Heuristic / rule-based — regex, JSON-schema validation, "did the agent call tool X." Fast, cheap, brittle. Best for hard structural checks.
  2. Reference-based — compare to a golden answer (exact match, embedding similarity, ROUGE). Useful only where the answer space is narrow.
  3. LLM-as-judge (single) — ask GPT-4o or Claude to score a single output on a rubric. Cheap to run at scale, but absolute scores are noisy. Calibration drifts between model versions.
  4. LLM-as-judge (pairwise) — show the judge both A and B, ask which is better. Vastly more reliable than absolute scoring (covered in depth in the pairwise post).

A production evaluator suite looks like: 30% heuristic (cheap gates), 10% reference-based (only where applicable), 40% pairwise LLM-as-judge (the workhorse), 20% human review (for the long tail and as judge-calibration ground truth).

Here's a runnable LangSmith evaluator that combines a heuristic check with an LLM-as-judge:

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from openai import OpenAI

client = Client()
oai = OpenAI()

def heuristic_called_required_tool(run, example) -> dict:
    """Did the agent invoke the tool we expected?"""
    required = example.outputs.get("must_call_tool")
    tool_calls = [
        s.name for s in run.child_runs or [] if s.run_type == "tool"
    ]
    return {
        "key": "called_required_tool",
        "score": 1 if required in tool_calls else 0,
    }

def llm_judge_helpfulness(run, example) -> dict:
    """LLM-as-judge: rate helpfulness 1-5."""
    prompt = f"""Rate the agent reply on helpfulness (1-5).
User asked: {example.inputs['user_query']}
Agent replied: {run.outputs['final_answer']}
Return JSON: {{"score": int, "reason": str}}"""
    resp = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    import json
    parsed = json.loads(resp.choices[0].message.content)
    return {"key": "helpfulness", "score": parsed["score"] / 5.0, "comment": parsed["reason"]}

# Run the experiment
results = evaluate(
    lambda inputs: support_agent(inputs["user_query"], inputs["account_id"]),
    data="support-agent-eval-v3",
    evaluators=[heuristic_called_required_tool, llm_judge_helpfulness],
    experiment_prefix="prompt-v17",
    max_concurrency=8,
)

The evaluate function is doing real work: it pulls every example from the dataset, runs your agent function, runs each evaluator against the resulting run, and posts everything as a LangSmith Experiment you can diff against the previous experiment in the UI. That diff view — side-by-side scores between v16 and v17 — is the unit of progress for an agent team.

Stage 5: Score — Aggregate, Slice, and Don't Lie to Yourself

Raw scores are not insight. You need at minimum: per-evaluator means with confidence intervals, score distributions (not just averages — a bimodal distribution hiding behind a mean of 0.7 is a real failure mode), and slice analysis (scores broken down by intent, model version, user tier, etc.). LangSmith experiments give you most of this out of the box, but the discipline is human: do not declare victory on a 2-point average improvement when your CI is wider than the delta.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

A simple rule I use: a candidate agent must beat the incumbent by at least 2x the standard error on the primary metric, on the dataset slice that matters most, before I ship. Anything less is noise.

Stage 6: CI Gate — Eval as Code Review

The final stage is what turns evaluation from a research activity into a deploy gate. The pattern: every PR that touches prompts, tools, or model selection runs the eval suite automatically; the experiment is posted as a PR comment with deltas; merging is blocked if any regression-blocking metric drops below the prior baseline.

graph TD
  A[Developer opens PR] --> B[CI runs evaluate]
  B --> C[Experiment posted to LangSmith]
  C --> D{Compare to baseline}
  D -->|primary metric down >2σ| E[Block merge]
  D -->|secondary down| F[Warn, require approval]
  D -->|all green| G[Auto-allow merge]
  E --> H[Developer iterates]
  H --> A
  G --> I[Deploy to canary]
  I --> J[Online evals on live traffic]
  J --> K{Drift detected?}
  K -->|yes| L[Auto-rollback]
  K -->|no| M[Promote to prod]

The handoff between offline (CI) and online (production) eval is critical. Offline eval is fast, deterministic, and small-scale. Online eval runs lighter-weight evaluators — usually heuristics + a sampled LLM judge — on live production traces, catching distribution shift the offline dataset can't. LangSmith's online eval feature lets you attach evaluators directly to a project so every production run gets scored without redeploying. That live score stream is what feeds rollback automation.

Tradeoffs You Should Make Consciously

Every team gets hit by the same three tradeoff axes:

  • Cost vs coverage. A pairwise LLM-as-judge run on 800 examples with GPT-4o costs roughly $4-8 per experiment. Run 20 experiments a week and you're at ~$500/month — trivial. Run a 5k-example dataset with multi-judge ensembles and you're at $5k/month — material. Right-size by deciding which evaluators run on every PR vs nightly vs weekly.
  • Latency vs depth. Agent traces are slow to replay. A 5-step agent on 800 examples can take 25-40 minutes wall-clock. CI patience runs out around 10 minutes. Solution: a small "smoke" dataset (50 examples) for every PR, the full suite nightly.
  • Human vs LLM judging. Humans are the source of truth, LLMs are the scaling mechanism. Calibrate the LLM judge against ~100 human-labeled examples and recheck the agreement rate (Cohen's kappa) monthly. If kappa drops below 0.6, the LLM judge is drifting and needs prompt refinement or a model swap.

How CallSphere Uses This Stack

This isn't theoretical. Across our voice and chat agents — healthcare intake, real-estate qualification, after-hours escalation, IT helpdesk — we run the exact six-stage stack described above. Every production call emits a LangSmith trace. Each vertical has its own curated dataset of 400-1,200 examples. Pull requests touching agent prompts gate on a 12-evaluator suite, with pairwise LLM-as-judge as the primary metric. Online evals run on 100% of production calls for safety-critical evaluators (PII leakage, escalation correctness) and on a 5% sample for everything else. The result: weeks-long regressions that used to ship undetected now get caught in CI before merge.

FAQ

Q: Do I need LangSmith specifically, or will Langfuse / Arize / Braintrust work? All four implement the same conceptual stack. LangSmith has the deepest integration with LangChain/LangGraph and the most mature pairwise eval UX. Langfuse is open-source and self-hostable. Arize Phoenix is strong on production drift detection. Braintrust has the slickest experiment-diff UI. Pick based on your stack and self-host requirements; the six stages don't change.

Q: How big should my eval dataset be? Start at 200, target 800 within 90 days, cap at 2-5k unless you're at GPT-4-class scale. Beyond 5k, signal stops improving and you're paying compute for redundancy. Quality of curation beats quantity every time.

Q: How often should I rerun the full eval suite? Smoke suite on every PR (50-100 examples, under 5 minutes). Full suite nightly (full dataset, 20-40 minutes). Cross-model bake-off weekly. Human re-annotation of a 100-example calibration set monthly.

Q: What's the single biggest mistake teams make? Optimizing for the average eval score instead of the worst-case slice. A 0.85 mean with a 0.40 floor on safety-critical intents is a worse system than a 0.78 mean with a 0.74 floor. Always look at the slice distribution.

Q: How do I evaluate agents without ground truth? This is exactly what pairwise LLM-as-judge solves — you don't need a golden answer, you need two candidate outputs and a rubric. See the companion post on pairwise vs reference-based scoring.

What To Build First

If you have nothing today, build in this order: (1) tracing, (2) a 200-example dataset from real traces, (3) one heuristic + one LLM-judge evaluator, (4) the experiment-diff view, (5) the CI gate, (6) online evals. Skip steps and you'll backfill them anyway, more painfully. The full stack is mechanical engineering once you accept that evaluation is a product surface, not a research activity.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.