By Sagar Shankaran, Founder of CallSphere
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
Key takeaways
The agent evaluation stack in 2026 is a six-stage pipeline: instrument → trace → dataset → evaluator → score → CI gate. Skip a stage and you ship regressions. I've watched teams burn entire quarters chasing eval theater — colorful dashboards, no signal — because they treated evaluation like a one-time vibe check instead of an always-on production loop. The reference implementation most teams converge on uses LangSmith for tracing, datasets, and evaluators, with pairwise LLM-as-judge wired into pull-request CI. This post walks the entire flow, including code you can paste, two mermaid diagrams of the data path, and the honest tradeoffs between online and offline eval. If you only build one part of this stack first, build the dataset of curated traces — everything else is plumbing around that asset.
When people first encounter LLM evaluation, they think of MMLU, HumanEval, GSM8K — academic benchmarks where there is a known answer and you compute accuracy. Agent evaluation is almost the opposite. The "input" isn't a prompt; it's a user goal plus tool environment. The "output" isn't a token; it's a trajectory — a sequence of model decisions, tool calls, retrieval hits, and final responses, often spanning 10-30 LLM calls per session. There is no single ground truth, latency matters as much as correctness, and the same input can legitimately produce three different acceptable outputs.
That changes what you measure. Traditional NLP metrics (BLEU, ROUGE, exact-match) collapse on agents. You need trajectory-aware evaluators — graders that look at the whole trace, not just the last message. You need reference-free evaluators for the long tail where ground truth doesn't exist. And you need a continuous loop: production traces flow back into the dataset, the dataset is rerun against new agent versions, and the experiment results gate deploys. The end state is a stack, not a script.
I'll define the stack first, then walk every stage with code.
flowchart LR
A[Production Agent] -->|emit spans| B[Tracing Layer]
B --> C[(Trace Store)]
C -->|curate examples| D[(Dataset)]
D --> E[Experiment Runner]
F[Candidate Agent vN+1] --> E
E --> G[Evaluators]
G --> H[(Eval Scores)]
H --> I{CI Gate}
I -->|pass| J[Deploy]
I -->|fail| K[Block PR]
A -->|online evals| G
G -->|annotation queue| L[Human Reviewers]
L --> D
The arrows that matter most are the two feedback loops: production traces flowing back into the dataset, and human annotations refining what the evaluator considers "good." Without those loops, your dataset goes stale in roughly six weeks and your evaluators drift away from real user behavior. Build the loops on day one.
You cannot evaluate what you cannot see. The first thing to ship is span-level tracing wrapping every LLM call, every tool call, and every retrieval. The OpenTelemetry-flavored model that LangSmith, Arize, and Langfuse all converge on uses runs (or "traces") composed of nested spans, each tagged with inputs, outputs, latency, token counts, and arbitrary metadata.
Here is the smallest possible LangSmith instrumentation that gives you a usable trace tree:
import os
from langsmith import traceable, Client
from openai import OpenAI
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "callsphere-agent-prod"
client = OpenAI()
ls = Client()
@traceable(run_type="tool", name="lookup_account")
def lookup_account(account_id: str) -> dict:
# ... real DB call ...
return {"id": account_id, "tier": "growth", "minutes_used": 4823}
@traceable(run_type="llm", name="reasoner")
def reason(messages: list[dict]) -> str:
resp = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=[{"type": "function", "function": {"name": "lookup_account"}}],
)
return resp.choices[0].message.content
@traceable(run_type="chain", name="support_agent")
def support_agent(user_query: str, account_id: str) -> str:
account = lookup_account(account_id)
return reason([
{"role": "system", "content": "You are a support agent."},
{"role": "user", "content": f"Account: {account}. Query: {user_query}"},
])
Three things to notice. First, @traceable nests automatically — the support_agent run becomes the parent, lookup_account and reason become children, and you get a tree view in the LangSmith UI for free. Second, every span carries inputs/outputs you'll later use as evaluator input. Third, run_type matters: it's how filters in datasets and online evals select which spans to score. Tag aggressively — tool, llm, chain, retriever, parser — because you'll thank yourself the first time you need to evaluate just the retrieval step in isolation.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
A trace is more than a log. It's a structured object you'll later replay, score, edit, and clone. Best practice is to capture five things on every parent run: the user-facing input, the final output, the full message history (every intermediate LLM call), every tool I/O, and metadata like user_id, session_id, model_version, and feature flags. The metadata is what lets you slice the dataset later — e.g., "show me all traces from users on the new prompt where tool_calls > 3 and final latency > 4s."
For agent eval specifically, you want trajectory replay-ability. That means deterministic seeds where possible, hashed prompts so you can detect when a system prompt mutated mid-session, and tool stubs so a unit-test rerun doesn't actually charge a customer's credit card. Most teams underinvest here and pay for it later when they can't reproduce a failure.
Production traffic is a fire hose. A dataset is a curated subset that represents the distribution you actually care about. The mistake I see most often is teams dumping 50,000 random traces into a "dataset" and calling it done. That's not a dataset, that's a backup. A real eval dataset is balanced, labeled, and small enough to rerun in under 10 minutes. Aim for 200-800 examples on launch, growing to 2-5k for mature systems.
How to build it:
| Source | What it gives you | Watch out for |
|---|---|---|
| Curated production traces | Real user distribution | Privacy/PII leakage |
| Hand-written edge cases | Coverage of rare failure modes | Drift from real usage |
| Synthetic generation | Cheap volume | Generator bias |
| Adversarial / red-team | Safety + jailbreak coverage | Over-indexing on theater |
| Human annotations | Ground truth labels | Annotator disagreement |
In LangSmith, a Dataset is a first-class object you can grow over time. The pattern that works: a daily cron pulls last-24h traces, samples by stratified slice (intent type, user tier, latency bucket), routes the sample to an annotation queue for human label, and merges approved examples into the dataset. The Datasets and Annotation Queues primitives in LangSmith are designed for exactly this loop.
import { Client } from "langsmith";
const ls = new Client();
// Create or get the canonical dataset
const dataset = await ls.createDataset("support-agent-eval-v3", {
description: "Curated support traces, stratified by intent",
});
// Add examples sourced from production traces
await ls.createExamples({
inputs: [
{ user_query: "Why was my call dropped at minute 7?", account_id: "acct_881" },
{ user_query: "How do I export last month's transcripts?", account_id: "acct_204" },
],
outputs: [
{ expected_intent: "diagnose_dropped_call", must_call_tool: "fetch_call_log" },
{ expected_intent: "export_transcripts", must_call_tool: "create_export_job" },
],
datasetId: dataset.id,
});
Notice the outputs aren't full reference answers — they're partial constraints: which intent the agent must classify, which tool it must call. This is a key trick for agent eval. Full reference answers don't exist for most agent outputs, but you can almost always state structural constraints. Evaluators score against the constraints, not against a golden string.
Evaluators in 2026 fall into four families. You will use all four; the question is in what ratio.
A production evaluator suite looks like: 30% heuristic (cheap gates), 10% reference-based (only where applicable), 40% pairwise LLM-as-judge (the workhorse), 20% human review (for the long tail and as judge-calibration ground truth).
Here's a runnable LangSmith evaluator that combines a heuristic check with an LLM-as-judge:
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
from openai import OpenAI
client = Client()
oai = OpenAI()
def heuristic_called_required_tool(run, example) -> dict:
"""Did the agent invoke the tool we expected?"""
required = example.outputs.get("must_call_tool")
tool_calls = [
s.name for s in run.child_runs or [] if s.run_type == "tool"
]
return {
"key": "called_required_tool",
"score": 1 if required in tool_calls else 0,
}
def llm_judge_helpfulness(run, example) -> dict:
"""LLM-as-judge: rate helpfulness 1-5."""
prompt = f"""Rate the agent reply on helpfulness (1-5).
User asked: {example.inputs['user_query']}
Agent replied: {run.outputs['final_answer']}
Return JSON: {{"score": int, "reason": str}}"""
resp = oai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
import json
parsed = json.loads(resp.choices[0].message.content)
return {"key": "helpfulness", "score": parsed["score"] / 5.0, "comment": parsed["reason"]}
# Run the experiment
results = evaluate(
lambda inputs: support_agent(inputs["user_query"], inputs["account_id"]),
data="support-agent-eval-v3",
evaluators=[heuristic_called_required_tool, llm_judge_helpfulness],
experiment_prefix="prompt-v17",
max_concurrency=8,
)
The evaluate function is doing real work: it pulls every example from the dataset, runs your agent function, runs each evaluator against the resulting run, and posts everything as a LangSmith Experiment you can diff against the previous experiment in the UI. That diff view — side-by-side scores between v16 and v17 — is the unit of progress for an agent team.
Raw scores are not insight. You need at minimum: per-evaluator means with confidence intervals, score distributions (not just averages — a bimodal distribution hiding behind a mean of 0.7 is a real failure mode), and slice analysis (scores broken down by intent, model version, user tier, etc.). LangSmith experiments give you most of this out of the box, but the discipline is human: do not declare victory on a 2-point average improvement when your CI is wider than the delta.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
A simple rule I use: a candidate agent must beat the incumbent by at least 2x the standard error on the primary metric, on the dataset slice that matters most, before I ship. Anything less is noise.
The final stage is what turns evaluation from a research activity into a deploy gate. The pattern: every PR that touches prompts, tools, or model selection runs the eval suite automatically; the experiment is posted as a PR comment with deltas; merging is blocked if any regression-blocking metric drops below the prior baseline.
graph TD
A[Developer opens PR] --> B[CI runs evaluate]
B --> C[Experiment posted to LangSmith]
C --> D{Compare to baseline}
D -->|primary metric down >2σ| E[Block merge]
D -->|secondary down| F[Warn, require approval]
D -->|all green| G[Auto-allow merge]
E --> H[Developer iterates]
H --> A
G --> I[Deploy to canary]
I --> J[Online evals on live traffic]
J --> K{Drift detected?}
K -->|yes| L[Auto-rollback]
K -->|no| M[Promote to prod]
The handoff between offline (CI) and online (production) eval is critical. Offline eval is fast, deterministic, and small-scale. Online eval runs lighter-weight evaluators — usually heuristics + a sampled LLM judge — on live production traces, catching distribution shift the offline dataset can't. LangSmith's online eval feature lets you attach evaluators directly to a project so every production run gets scored without redeploying. That live score stream is what feeds rollback automation.
Every team gets hit by the same three tradeoff axes:
This isn't theoretical. Across our voice and chat agents — healthcare intake, real-estate qualification, after-hours escalation, IT helpdesk — we run the exact six-stage stack described above. Every production call emits a LangSmith trace. Each vertical has its own curated dataset of 400-1,200 examples. Pull requests touching agent prompts gate on a 12-evaluator suite, with pairwise LLM-as-judge as the primary metric. Online evals run on 100% of production calls for safety-critical evaluators (PII leakage, escalation correctness) and on a 5% sample for everything else. The result: weeks-long regressions that used to ship undetected now get caught in CI before merge.
Q: Do I need LangSmith specifically, or will Langfuse / Arize / Braintrust work? All four implement the same conceptual stack. LangSmith has the deepest integration with LangChain/LangGraph and the most mature pairwise eval UX. Langfuse is open-source and self-hostable. Arize Phoenix is strong on production drift detection. Braintrust has the slickest experiment-diff UI. Pick based on your stack and self-host requirements; the six stages don't change.
Q: How big should my eval dataset be? Start at 200, target 800 within 90 days, cap at 2-5k unless you're at GPT-4-class scale. Beyond 5k, signal stops improving and you're paying compute for redundancy. Quality of curation beats quantity every time.
Q: How often should I rerun the full eval suite? Smoke suite on every PR (50-100 examples, under 5 minutes). Full suite nightly (full dataset, 20-40 minutes). Cross-model bake-off weekly. Human re-annotation of a 100-example calibration set monthly.
Q: What's the single biggest mistake teams make? Optimizing for the average eval score instead of the worst-case slice. A 0.85 mean with a 0.40 floor on safety-critical intents is a worse system than a 0.78 mean with a 0.74 floor. Always look at the slice distribution.
Q: How do I evaluate agents without ground truth? This is exactly what pairwise LLM-as-judge solves — you don't need a golden answer, you need two candidate outputs and a rubric. See the companion post on pairwise vs reference-based scoring.
If you have nothing today, build in this order: (1) tracing, (2) a 200-example dataset from real traces, (3) one heuristic + one LLM-judge evaluator, (4) the experiment-diff view, (5) the CI gate, (6) online evals. Skip steps and you'll backfill them anyway, more painfully. The full stack is mechanical engineering once you accept that evaluation is a product surface, not a research activity.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A clean before/after of agent architecture in 2026. The control loop moved from your framework code into the model's reasoning chain. What that looks like.
Google's May 2026 MCP 1.0 + A2A developers guide is the cleanest protocol picker we have seen. The takeaways, in plain English, with a CallSphere lens.
Workspace Studio puts a Gemini-powered AI agent builder inside Google Workspace. A walkthrough of what it does, who it is for, and where it fits in 2026.
How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.
Gemini 3.1 Ultra ships with a 2-million token context window and full text, image, audio, and video multimodality. What changes and how to build for it.
Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI