Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

TL;DR

Offline and online evaluation are not redundant — they answer different questions. Offline evals run a frozen dataset through a candidate agent before you ship, gating deploys against regressions you can reproduce. Online evals sample live production traffic after you ship, catching drift, edge cases, and quality decay that no curated dataset will ever contain. Skip offline and you regress silently on the next prompt change. Skip online and you discover the regression from a Twitter screenshot. We run both, and we wire them into the same LangSmith project so the same evaluator code grades a pre-deploy run and a post-deploy live trace.

The Two Worlds Are Not Interchangeable

The most common mistake I see in agent eval setups: teams pick one of offline or online, declare victory, and ship. Either choice on its own leaves a gap that production will eventually find for you.

Offline-only teams ship confidently against a 200-row golden dataset, then discover that real users ask things the dataset never anticipated. The eval suite is green; the inbox is on fire.
Online-only teams instrument every production trace, watch dashboards, and fix things reactively. They have no way to prevent a regression — only detect one after users have hit it.

The mental model that works is the pre-deploy / post-deploy split: offline owns the gate before the change goes live, online owns the lens after it does. Both are continuous. Neither is optional.

What Each Actually Means

Offline evaluation (pre-deploy, dataset-driven)

You curate a dataset of inputs — sometimes with reference outputs, sometimes with reference behaviors — and run your agent against every row. An evaluator scores each output (correctness, faithfulness, tool-call accuracy, latency, cost). The result is an "experiment" you can compare against the previous experiment.

Offline is deterministic in setup, non-deterministic in output. You control the inputs; the agent's stochasticity controls the outputs. You re-run when something changes — prompt, model, tools, retrieval index, anything.

Typical signals:

Pass rate on the golden set (e.g., 92.4% correctness on v1.7.0 vs. 91.1% on v1.6.4 — you're up).
Per-tag drilldown (booking flow at 96%, escalation flow at 78% — fix escalation).
Cost / latency budgets (P95 latency on the new prompt is 1.4s, was 0.9s — investigate).

Online evaluation (post-deploy, live-traffic)

You attach evaluator rules to a sampled stream of production traces. As real conversations land, evaluators (LLM-as-judge, deterministic checks, or human review queues) score them in near-real-time and write feedback back onto the trace.

Online is deterministic in nothing. You don't control the inputs (real users), you don't control the outputs (real agent), and you usually can't replay the exact moment in time.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Typical signals:

Drift (correctness average drops from 0.91 to 0.82 over a week — what changed in user behavior?).
Tail risk (find the 0.3% of traces with hallucination=true and route them to a queue).
Cohort effects (Spanish-language traces have 0.74 satisfaction; English has 0.89).

The Lifecycle

flowchart LR
  A[Prompt / model / tool change] --> B[Offline experiment on frozen dataset]
  B --> C{Regression vs prev?}
  C -- No --> D[Promote to canary]
  C -- Yes --> E[Reject / iterate]
  D --> F[Live traffic with online evaluators]
  F --> G{Online metric drop?}
  G -- No --> H[Full rollout]
  G -- Yes --> I[Rollback + capture failing traces]
  I --> J[Add traces to dataset]
  J --> B
  H --> F
  style C fill:#fef3c7
  style G fill:#fee2e2
  style J fill:#dbeafe

Figure 1 — Offline gates the deploy; online gates the rollout; failing online traces get harvested back into the offline dataset. The loop closes.

The Frozen Dataset Problem (and Why Online Solves It)

A dataset frozen on day 0 is already stale on day 30. Real user inputs drift — new product launches, seasonal language, regional dialects, scams that didn't exist last quarter. If your only evaluator is an offline run against the day-0 dataset, every new failure mode in production is invisible to your CI.

Online evaluation is how that drift becomes visible. The pattern:

Sample 10–20% of production traces (more for low-volume use cases, less for high-volume).
Run an LLM-as-judge or deterministic evaluator on each sampled trace.
Write feedback back onto the trace via the SDK.
Filter for low scores, route to a human review queue.
Promote reviewed traces into the offline dataset so the next pre-deploy run will catch the regression you just discovered.

That last step is the part most teams skip. It's also the only step that makes offline evals improve over time. Without it, your golden set ages out and your online evals become permanent firefighting.

Implementing Offline Evals (LangSmith)

Here is the canonical offline pattern. Pull a dataset, run your agent, score each output, persist the experiment.

from langsmith import Client, evaluate
from langsmith.evaluation import LangChainStringEvaluator

client = Client()

# 1. Define the target — your candidate agent.
def my_agent(inputs: dict) -> dict:
    from my_app import run_agent
    return {"output": run_agent(inputs["question"])}

# 2. Define evaluators. Mix LLM-as-judge with deterministic checks.
def correctness_evaluator(run, example):
    pred = run.outputs["output"]
    ref = example.outputs["expected"]
    # LLM-as-judge under the hood
    return {"key": "correctness", "score": int(pred.strip() == ref.strip())}

def latency_evaluator(run, example):
    ms = (run.end_time - run.start_time).total_seconds() * 1000
    return {"key": "latency_ms", "score": ms, "value": ms}

# 3. Run the experiment. Compares automatically against prior runs.
results = evaluate(
    my_agent,
    data="agent-golden-v3",          # dataset name in LangSmith
    evaluators=[correctness_evaluator, latency_evaluator],
    experiment_prefix="prompt-v1.7.0",
    max_concurrency=8,
)

print(f"Experiment: {results.experiment_name}")
# Use the comparison view in the LangSmith UI to diff against prompt-v1.6.4.

A few production lessons baked into that snippet:

Always pin an experiment_prefix to the version you're testing. Future you needs to know which run came from which prompt SHA.
Mix evaluator types. A pure LLM-as-judge will rubber-stamp some failures; a pure deterministic check will miss semantic regressions. Stack them.
Keep max_concurrency honest. Real production rate limits apply during evals — don't tune your eval parallelism to numbers that won't survive on canary.
Dataset names are forever. Treat them like database tables. Don't rename agent-golden to agent-golden-v2 — append rows or fork to a new dataset.

Implementing Online Evals (LangSmith)

Online evaluators are different in shape. Instead of running your agent, they attach to traces as they land. The simplest pattern is to write feedback directly:

import { Client } from "langsmith";

const client = new Client();

// Called from a background worker that pulls sampled production runs.
async function scoreLiveTrace(runId: string, trace: AgentTrace) {
  // 1. Run your evaluator. Could be an LLM judge, a regex check, anything.
  const judgement = await llmJudge({
    input: trace.input,
    output: trace.output,
    rubric: "Did the agent answer the user's actual question without fabricating policy?",
  });

  // 2. Write feedback back onto the run. Visible in the LangSmith UI.
  await client.createFeedback(runId, "faithfulness", {
    score: judgement.score,           // 0..1
    value: judgement.label,           // "faithful" | "fabricated"
    comment: judgement.reasoning,
    feedbackSourceType: "model",
  });

  // 3. If the score is bad, fan out to the review queue.
  if (judgement.score < 0.5) {
    await enqueueHumanReview({
      runId,
      reason: "low-faithfulness",
      trace,
    });
  }
}

In practice you'll wire this behind one of three triggers:

Sampling rule (e.g., 15% of all traces) — the default for stable populations.
Stratified sampling (e.g., 5% of normal + 100% of any trace with escalated=true) — catches tail risk without blowing the budget.
Outlier triggers (e.g., any trace where latency > P99 or token usage > 4x median) — catches new failure modes you didn't predict.

LangSmith's hosted online evaluators handle the sampling and dispatch for you, but the feedback shape is the same regardless.

Cost: Why You Cannot Run 100% Online Evals

LLM-as-judge online evaluation is not free. A 50-cent voice call evaluated by GPT-4o judge with a 2,000-token rubric costs another ~3 cents. At 1M traces/month that is $30,000/month in eval cost alone.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Sampling tiers we actually use in production:

Trace category	Online sample rate	Why
Routine inbound	10%	High volume, low variance — drift is what we care about
Tool-call traces	30%	Tool errors are silent; we need denser coverage
Escalations / human handoffs	100%	These are by definition the failure cases
New flow, first 30 days	50%	Burn-in period, dataset is small, online is our only signal
Multilingual / non-English	25%	Underrepresented in the offline dataset

The combined effective rate sits around 15–18% of traces sampled, which keeps eval cost under 5% of inference cost — the rule of thumb I keep coming back to.

Comparing the Two

Dimension	Offline Evaluation	Online Evaluation
When it runs	Pre-deploy, on demand	Post-deploy, continuously
Inputs	Curated, frozen dataset	Live production traffic
Reproducibility	High (same dataset, same eval)	Low (real-time, can't replay perfectly)
Coverage	Only what's in the dataset	Everything users actually do
Catches regressions	Yes, before users see them	Yes, after users see them
Catches drift	No (dataset is stale)	Yes (it's the whole point)
Cost profile	One-time per change	Continuous, sampled
Gates deploys	Yes	Usually no (alerts instead)
Failure mode	Stale dataset hides regressions	Reactive, no prevention

The right answer is to run both, link them in the same project, and feed online failures back into the offline dataset on a weekly cadence.

How CallSphere Wires This for Voice

Voice agents add two complications: (1) every trace is multi-turn (the unit of evaluation is a conversation, not a single LLM call) and (2) failures often look fine to a transcript-only evaluator but feel awful to the human on the line — long pauses, talk-over, wrong tone.

The hybrid eval setup we run on CallSphere's products:

Offline: 1,200-row golden dataset of full conversations across healthcare, real estate, sales, salon, IT helpdesk, after-hours. Each row is the full multi-turn input plus an expected outcome (booking made, lead qualified, ticket created). We re-run on every prompt change, every voice model swap, every retrieval-index rebuild. Gate is "no regression > 1.5% on any vertical."
Online: 15% sampled live calls, evaluated by GPT-4o judge with a vertical-specific rubric. Plus 100% of any call flagged escalated or csat<3. Plus a deterministic check for hallucinated commitments (price quotes, appointment times, refund amounts) on every single call — that one is too cheap not to run at 100%.
Loop closure: weekly job that pulls the lowest-scoring 50 online traces, runs them through human review, and merges the confirmed failures into the offline dataset. The dataset grows by 30–80 rows a week.

What Breaks When You Skip One

Skip offline: every prompt tweak is a coin flip. Your model upgrade from GPT-4o to GPT-4.1 silently breaks the booking flow because the new model interprets your function schema differently. Users discover it. You roll back.
Skip online: your offline pass rate stays at 94% for six months while your real-world correctness slides to 78% because users started asking about a product feature that didn't exist when the dataset was frozen.
Skip the loop back from online to offline: you keep catching the same class of regression three releases in a row because the failing traces never made it into your gating dataset.

FAQ

Do I need both offline and online evaluation if I'm a small team?

Yes, but you can scale them down. The minimum viable version: a 50-row offline dataset that runs on every prompt change, and an online evaluator at 5% sampling that just flags low-confidence outputs. Total setup time, two days. Total monthly cost, under $50.

Can my offline dataset replace online evals if it's big enough?

No. The offline dataset can only contain things you've already thought of. Online evaluation is how you discover the things you haven't. A 50,000-row offline set is still blind to whatever your users do for the first time tomorrow.

How do I handle non-determinism in offline evals?

Two patterns. (1) Run each input N times (typically 3–5) and report mean/variance — flag any item with high variance for review. (2) Use semantic equivalence evaluators (LLM-as-judge with a "do these mean the same thing?" rubric) instead of strict string match. We do both.

Should online evaluators block production traffic?

Almost never. Online evals run async — they read the trace after it lands and write feedback. Putting an LLM judge in the synchronous path adds 1–3 seconds and a second failure mode. Use them for monitoring and post-hoc routing, not for gating.

How do I prevent my offline dataset from going stale?

Schedule the loop. Weekly: pull the 50 lowest-scoring online traces from the past 7 days, human-review them, merge confirmed failures into the offline set. Quarterly: prune redundant rows and rebalance per-tag coverage. The dataset is a living artifact; treat it like one.

The Bottom Line

Offline evaluation tells you what your agent does on the inputs you've thought of. Online evaluation tells you what it does on the inputs you haven't. The pre-deploy / post-deploy split is real, the work is doable in a week with LangSmith's evaluate API and online evaluators, and the loop between them is the difference between a ship-it culture and a fight-the-fires culture.

Run both. Gate on offline. Monitor with online. Feed failures back. The teams I see succeed in production are the ones who treat the dataset as the living artifact and the eval pipeline as the production system it really is.