By Sagar Shankaran, Founder of CallSphere
Non-deterministic agents break silently when prompts, models, or tools change. Build a regression pipeline with frozen datasets, semantic diffing, and gate thresholds.
Key takeaways
Regression testing in classical software is straightforward: same input, same expected output, fail if they don't match. Agents break that contract — same input produces different outputs every time, and most of those outputs are equally correct. Yet the underlying need is identical: when you change a prompt, model, tool, or retrieval index, you must know whether quality just dropped before users find out for you. The pattern that works is frozen datasets + semantic diffing + statistical gating: you compare experiment B against a baseline experiment A on the same inputs, score the differences semantically, and block the deploy if the regression rate or magnitude exceeds a threshold you set in advance.
In a deterministic codebase, a regression test is a snapshot. add(2, 3) == 5. Either the assertion holds or it doesn't.
For agents, the same input produces a distribution of outputs. The same prompt, the same temperature, the same model, run twice — different tokens, different word order, sometimes different reasoning paths entirely. Strict equality is meaningless. What matters is whether the behavior on input X is still acceptable after the change.
The reframe I keep returning to: a regression in agent-land is a statistical claim, not a binary one. You're not asking "did the output change?" (it always did). You're asking "did the quality distribution over a representative input set get worse?"
That changes the whole shape of the test pipeline:
The reason I lead with this framing: silent breakage is the dominant failure mode in agent systems, and it's specifically the one that traditional CI does not catch.
A representative incident from a real deployment:
That entire incident was preventable by a regression suite that scored "did the agent confirm before booking?" as a binary check on a 200-row dataset. The check would have flipped from 99% to 71% on the new prompt. The deploy would have been blocked. Instead it shipped, ran for 5 days, and the recovery cost was 11% of a week's revenue.
This is the shape of every silent regression I have ever seen in production. The metric you forgot to test moves silently while the metrics you did test all look fine.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
sequenceDiagram
participant Dev as Developer
participant CI as CI
participant LS as LangSmith
participant Gate as Gate Logic
participant Prod as Production
Dev->>CI: Open PR (prompt / model / tool change)
CI->>LS: Run experiment B on frozen dataset
LS->>LS: Score with deterministic + LLM-judge evaluators
LS->>Gate: Compare experiment B vs baseline A
Gate->>Gate: Apply thresholds (per-eval, per-tag, per-row)
alt Within thresholds
Gate->>CI: PASS
CI->>Dev: Merge allowed
Dev->>Prod: Promote to canary
else Regression detected
Gate->>CI: FAIL with diff report
CI->>Dev: Block merge, link to flipped rows
Dev->>Dev: Investigate or override with sign-off
end
Figure 1 — Each PR runs a full experiment, the gate compares to the last known-good baseline, and the comparison view is the artifact a human reviews when the gate fires.
Everything starts with the dataset. Without one, you have no regression suite, only vibes.
Three buckets, roughly equal weight:
Each row has:
interface RegressionRow {
id: string;
input: { question: string; context?: object; user_locale?: string };
expected: {
// Reference output, optional. For semantic eval.
answer?: string;
// Behavioral expectations. Almost always more useful than answer text.
must_call_tool?: string[]; // e.g., ["check_inventory"]
must_not_say?: string[]; // e.g., ["I'm an AI"]
must_confirm?: boolean; // critical for booking flows
expected_outcome?: "booked" | "qualified" | "escalated" | "deflected";
};
tags: string[]; // ["booking", "spanish", "edge-case"]
baseline_score?: number; // last known-good score
added_after_incident?: string; // INC-2026-04-12, etc.
}
Behavioral expectations matter more than reference answers. "Did the agent call check_inventory before quoting a price?" is testable. "Did the agent produce exactly this paragraph?" is unhelpful — the paragraph will change every run, and that change is rarely the regression you care about.
A useful regression suite stacks three kinds of evaluators. Any one of them alone misses regressions; together they catch the long tail.
Cheap, fast, exact. Run on every row.
These catch the dumb regressions that LLM-judges sometimes excuse.
LLM-as-judge with a "is the new answer equivalent in meaning to the reference?" rubric. Catches the case where the wording changed but the meaning is still right. Also catches the inverse — wording is similar but a key fact got dropped.
This is the regression-specific evaluator. Instead of scoring B in isolation, you show the judge A's output and B's output side by side and ask "which is better, or are they equivalent?" This is the most sensitive regression detector I have. It catches subtle quality drops that absolute scoring smooths over.
from langsmith import Client, evaluate
import asyncio
client = Client()
# Pairwise evaluator: compares candidate run against baseline run on same input.
async def pairwise_better(run, example):
# Pull the baseline run for this example_id from prior experiment.
baseline = client.read_example(example.id).metadata.get("baseline_output")
candidate = run.outputs["output"]
judgement = await llm_judge_pairwise(
input=example.inputs,
a=baseline,
b=candidate,
rubric=(
"Compare A and B as responses to the user's input. "
"Score B relative to A: -1 if B is worse, 0 if equivalent, 1 if better. "
"A regression is any score of -1."
),
)
# Score: 1.0 = no regression, 0.0 = regression
return {
"key": "pairwise_no_regression",
"score": 0.0 if judgement.score == -1 else 1.0,
"value": judgement.score,
"comment": judgement.reasoning,
}
# Run the candidate experiment.
results = await client.aevaluate(
target_agent,
data="agent-regression-v4",
evaluators=[pairwise_better, deterministic_tool_check, schema_validator],
experiment_prefix=f"candidate-{commit_sha}",
max_concurrency=10,
metadata={"baseline_experiment": baseline_experiment_id},
)
A few things buried in there worth pulling out:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
aevaluate is the async version. For datasets over 100 rows it cuts wall-clock time roughly 5x with concurrent dispatch.This is the part most teams underbuild. The gate isn't "did the average score drop?" — it's a stack of thresholds, any one of which can block the merge.
def evaluate_gate(candidate, baseline, dataset_tags):
failures = []
# 1. No eval can drop more than 2% on average.
for eval_key in candidate.evals:
delta = candidate.evals[eval_key].mean - baseline.evals[eval_key].mean
if delta < -0.02:
failures.append(
f"{eval_key}: dropped {-delta:.1%} "
f"({baseline.evals[eval_key].mean:.3f} -> {candidate.evals[eval_key].mean:.3f})"
)
# 2. No tag can drop more than 5% (catches per-vertical regressions).
for tag in dataset_tags:
for eval_key in ["correctness", "tool_use", "pairwise_no_regression"]:
cand_tag = candidate.evals_by_tag[tag][eval_key].mean
base_tag = baseline.evals_by_tag[tag][eval_key].mean
if (cand_tag - base_tag) < -0.05:
failures.append(f"tag={tag} {eval_key}: dropped {base_tag - cand_tag:.1%}")
# 3. No individual row can flip from pass->fail without explicit review.
flipped = [
ex for ex in candidate.examples
if baseline.score(ex.id) >= 0.5 and candidate.score(ex.id) < 0.5
]
if flipped and not candidate.metadata.get("flipped_rows_reviewed"):
failures.append(
f"{len(flipped)} rows flipped pass->fail. "
f"Review them in the LangSmith comparison view, "
f"then re-run with metadata.flipped_rows_reviewed=true."
)
# 4. Latency P95 ceiling.
if candidate.latency_p95 > baseline.latency_p95 * 1.20:
failures.append(
f"latency p95 regressed {candidate.latency_p95:.0f}ms "
f"vs baseline {baseline.latency_p95:.0f}ms (>20%)"
)
return failures # empty list = gate passes
The "explicit review" carve-out on flipped rows is the part that took me longest to get right. Sometimes a row should flip — the old behavior was wrong, the new behavior is right, the eval is just lagging. Forcing a human to look at the LangSmith comparison view, eyeball the flips, and stamp flipped_rows_reviewed: true is the right escape hatch. It's slow enough to be a real check and fast enough not to be a process tax.
| Stage | Trigger | Action | Gate |
|---|---|---|---|
| PR open | Push to feature branch | Smoke test on 25-row mini-dataset | Pass rate > 90% |
| PR ready for review | Label ready-for-eval |
Full regression suite (1,200 rows) | All thresholds in gate logic |
| Merged to main | Merge | Re-run regression suite, update baseline if green | Promote to baseline |
| Promote to canary | Manual | Online evaluators on 5% live traffic | Online metrics within band for 24h |
| Full rollout | Manual after canary | Online evaluators on 100% with rollback armed | Continuous monitoring |
Two things to call out:
ready-for-eval label.The regression suite for CallSphere's voice and chat agents runs on every prompt change, model swap, and retrieval-index rebuild. Concrete numbers:
ready-for-eval label, parallel via aevaluate.The whole pipeline is documented in our glossary entry on regression testing and runs on LangSmith's experiments and comparison views.
Start at 50 rows; aim for 500 within three months; settle around 1,000–2,000 long-term. Past 2,000 you mostly buy diminishing returns — the marginal row catches fewer regressions than it costs to run.
Three patterns, used together. (1) Run each row N=3 times and use the median score. (2) Use semantic equivalence and pairwise evaluators instead of strict-match. (3) Set per-row variance thresholds — a row whose own score varies by more than X across N runs is unreliable; flag it for dataset cleanup, don't gate on it.
I run "average pairwise_no_regression > 0.97" plus "no per-tag pairwise score < 0.92". Tighter than that triggers too many false positives; looser misses real regressions. Tune on your own historical data — find a real past regression, see what threshold would have caught it, set there.
Block merge for the named gate violations. Warn for soft signals (latency creep, cost creep). Override is allowed but requires explicit metadata on the PR — it's slow enough to discourage drive-by overrides, fast enough not to be a process tax.
Build the dataset first, gate second. (1) Curate 100 rows for the new vertical with behavioral expectations. (2) Run the existing eval pipeline against the new dataset on the current production version — that becomes your initial baseline. (3) Add per-vertical thresholds. (4) Wire it into CI. Total ramp: about a week.
The classical regression-test mental model — same input, same output, equality assertion — does not apply to non-deterministic agents, but the underlying engineering need does. The pattern that works is frozen dataset, semantic and pairwise evaluators, statistical gating with per-eval and per-tag thresholds, and a human-reviewed escape hatch for legitimate flips. Wire it into CI behind a label so the full suite doesn't run on every commit; promote baselines explicitly after merge; never let synthetic data replace production-curated edge cases.
Silent breakage is the failure mode that costs the most because it ships. A regression suite built this way is the cheapest insurance against it. Build it before you need it — by the time you need it, you've already paid the incident cost the suite would have prevented.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for multilingual customer support — a May 2026 comparison grounded in current model prices, benchm...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI