By Sagar Shankaran, Founder of CallSphere
Offline evals catch regressions before deploy on a fixed dataset. Online evals catch real-world drift on live traffic. You need both — here is how we run them.
Key takeaways
Offline and online evaluation are not redundant — they answer different questions. Offline evals run a frozen dataset through a candidate agent before you ship, gating deploys against regressions you can reproduce. Online evals sample live production traffic after you ship, catching drift, edge cases, and quality decay that no curated dataset will ever contain. Skip offline and you regress silently on the next prompt change. Skip online and you discover the regression from a Twitter screenshot. We run both, and we wire them into the same LangSmith project so the same evaluator code grades a pre-deploy run and a post-deploy live trace.
The most common mistake I see in agent eval setups: teams pick one of offline or online, declare victory, and ship. Either choice on its own leaves a gap that production will eventually find for you.
The mental model that works is the pre-deploy / post-deploy split: offline owns the gate before the change goes live, online owns the lens after it does. Both are continuous. Neither is optional.
You curate a dataset of inputs — sometimes with reference outputs, sometimes with reference behaviors — and run your agent against every row. An evaluator scores each output (correctness, faithfulness, tool-call accuracy, latency, cost). The result is an "experiment" you can compare against the previous experiment.
Offline is deterministic in setup, non-deterministic in output. You control the inputs; the agent's stochasticity controls the outputs. You re-run when something changes — prompt, model, tools, retrieval index, anything.
Typical signals:
You attach evaluator rules to a sampled stream of production traces. As real conversations land, evaluators (LLM-as-judge, deterministic checks, or human review queues) score them in near-real-time and write feedback back onto the trace.
Online is deterministic in nothing. You don't control the inputs (real users), you don't control the outputs (real agent), and you usually can't replay the exact moment in time.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Typical signals:
flowchart LR
A[Prompt / model / tool change] --> B[Offline experiment on frozen dataset]
B --> C{Regression vs prev?}
C -- No --> D[Promote to canary]
C -- Yes --> E[Reject / iterate]
D --> F[Live traffic with online evaluators]
F --> G{Online metric drop?}
G -- No --> H[Full rollout]
G -- Yes --> I[Rollback + capture failing traces]
I --> J[Add traces to dataset]
J --> B
H --> F
style C fill:#fef3c7
style G fill:#fee2e2
style J fill:#dbeafe
Figure 1 — Offline gates the deploy; online gates the rollout; failing online traces get harvested back into the offline dataset. The loop closes.
A dataset frozen on day 0 is already stale on day 30. Real user inputs drift — new product launches, seasonal language, regional dialects, scams that didn't exist last quarter. If your only evaluator is an offline run against the day-0 dataset, every new failure mode in production is invisible to your CI.
Online evaluation is how that drift becomes visible. The pattern:
That last step is the part most teams skip. It's also the only step that makes offline evals improve over time. Without it, your golden set ages out and your online evals become permanent firefighting.
Here is the canonical offline pattern. Pull a dataset, run your agent, score each output, persist the experiment.
from langsmith import Client, evaluate
from langsmith.evaluation import LangChainStringEvaluator
client = Client()
# 1. Define the target — your candidate agent.
def my_agent(inputs: dict) -> dict:
from my_app import run_agent
return {"output": run_agent(inputs["question"])}
# 2. Define evaluators. Mix LLM-as-judge with deterministic checks.
def correctness_evaluator(run, example):
pred = run.outputs["output"]
ref = example.outputs["expected"]
# LLM-as-judge under the hood
return {"key": "correctness", "score": int(pred.strip() == ref.strip())}
def latency_evaluator(run, example):
ms = (run.end_time - run.start_time).total_seconds() * 1000
return {"key": "latency_ms", "score": ms, "value": ms}
# 3. Run the experiment. Compares automatically against prior runs.
results = evaluate(
my_agent,
data="agent-golden-v3", # dataset name in LangSmith
evaluators=[correctness_evaluator, latency_evaluator],
experiment_prefix="prompt-v1.7.0",
max_concurrency=8,
)
print(f"Experiment: {results.experiment_name}")
# Use the comparison view in the LangSmith UI to diff against prompt-v1.6.4.
A few production lessons baked into that snippet:
experiment_prefix to the version you're testing. Future you needs to know which run came from which prompt SHA.max_concurrency honest. Real production rate limits apply during evals — don't tune your eval parallelism to numbers that won't survive on canary.agent-golden to agent-golden-v2 — append rows or fork to a new dataset.Online evaluators are different in shape. Instead of running your agent, they attach to traces as they land. The simplest pattern is to write feedback directly:
import { Client } from "langsmith";
const client = new Client();
// Called from a background worker that pulls sampled production runs.
async function scoreLiveTrace(runId: string, trace: AgentTrace) {
// 1. Run your evaluator. Could be an LLM judge, a regex check, anything.
const judgement = await llmJudge({
input: trace.input,
output: trace.output,
rubric: "Did the agent answer the user's actual question without fabricating policy?",
});
// 2. Write feedback back onto the run. Visible in the LangSmith UI.
await client.createFeedback(runId, "faithfulness", {
score: judgement.score, // 0..1
value: judgement.label, // "faithful" | "fabricated"
comment: judgement.reasoning,
feedbackSourceType: "model",
});
// 3. If the score is bad, fan out to the review queue.
if (judgement.score < 0.5) {
await enqueueHumanReview({
runId,
reason: "low-faithfulness",
trace,
});
}
}
In practice you'll wire this behind one of three triggers:
escalated=true) — catches tail risk without blowing the budget.LangSmith's hosted online evaluators handle the sampling and dispatch for you, but the feedback shape is the same regardless.
LLM-as-judge online evaluation is not free. A 50-cent voice call evaluated by GPT-4o judge with a 2,000-token rubric costs another ~3 cents. At 1M traces/month that is $30,000/month in eval cost alone.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Sampling tiers we actually use in production:
| Trace category | Online sample rate | Why |
|---|---|---|
| Routine inbound | 10% | High volume, low variance — drift is what we care about |
| Tool-call traces | 30% | Tool errors are silent; we need denser coverage |
| Escalations / human handoffs | 100% | These are by definition the failure cases |
| New flow, first 30 days | 50% | Burn-in period, dataset is small, online is our only signal |
| Multilingual / non-English | 25% | Underrepresented in the offline dataset |
The combined effective rate sits around 15–18% of traces sampled, which keeps eval cost under 5% of inference cost — the rule of thumb I keep coming back to.
| Dimension | Offline Evaluation | Online Evaluation |
|---|---|---|
| When it runs | Pre-deploy, on demand | Post-deploy, continuously |
| Inputs | Curated, frozen dataset | Live production traffic |
| Reproducibility | High (same dataset, same eval) | Low (real-time, can't replay perfectly) |
| Coverage | Only what's in the dataset | Everything users actually do |
| Catches regressions | Yes, before users see them | Yes, after users see them |
| Catches drift | No (dataset is stale) | Yes (it's the whole point) |
| Cost profile | One-time per change | Continuous, sampled |
| Gates deploys | Yes | Usually no (alerts instead) |
| Failure mode | Stale dataset hides regressions | Reactive, no prevention |
The right answer is to run both, link them in the same project, and feed online failures back into the offline dataset on a weekly cadence.
Voice agents add two complications: (1) every trace is multi-turn (the unit of evaluation is a conversation, not a single LLM call) and (2) failures often look fine to a transcript-only evaluator but feel awful to the human on the line — long pauses, talk-over, wrong tone.
The hybrid eval setup we run on CallSphere's products:
escalated or csat<3. Plus a deterministic check for hallucinated commitments (price quotes, appointment times, refund amounts) on every single call — that one is too cheap not to run at 100%.Yes, but you can scale them down. The minimum viable version: a 50-row offline dataset that runs on every prompt change, and an online evaluator at 5% sampling that just flags low-confidence outputs. Total setup time, two days. Total monthly cost, under $50.
No. The offline dataset can only contain things you've already thought of. Online evaluation is how you discover the things you haven't. A 50,000-row offline set is still blind to whatever your users do for the first time tomorrow.
Two patterns. (1) Run each input N times (typically 3–5) and report mean/variance — flag any item with high variance for review. (2) Use semantic equivalence evaluators (LLM-as-judge with a "do these mean the same thing?" rubric) instead of strict string match. We do both.
Almost never. Online evals run async — they read the trace after it lands and write feedback. Putting an LLM judge in the synchronous path adds 1–3 seconds and a second failure mode. Use them for monitoring and post-hoc routing, not for gating.
Schedule the loop. Weekly: pull the 50 lowest-scoring online traces from the past 7 days, human-review them, merge confirmed failures into the offline set. Quarterly: prune redundant rows and rebalance per-tag coverage. The dataset is a living artifact; treat it like one.
Offline evaluation tells you what your agent does on the inputs you've thought of. Online evaluation tells you what it does on the inputs you haven't. The pre-deploy / post-deploy split is real, the work is doable in a week with LangSmith's evaluate API and online evaluators, and the loop between them is the difference between a ship-it culture and a fight-the-fires culture.
Run both. Gate on offline. Monitor with online. Feed failures back. The teams I see succeed in production are the ones who treat the dataset as the living artifact and the eval pipeline as the production system it really is.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
Self-correction is now a property of the model, not the framework. What that means for production agent reliability, voice/chat fallbacks, and CallSphere.
How to design a multi-agent system using MCP for tools and A2A for cross-vendor coordination, with a CallSphere voice agent as a participating node.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI