By Sagar Shankaran, Founder of CallSphere
A real workflow: user complaint → LangSmith trace → reproduce in dataset → fix → ship → re-eval. Principal-engineer notes, real numbers, honest tradeoffs.
Key takeaways
Most agent bugs in production look the same on the surface — "the bot said the wrong thing" — and the worst possible debugging response is to open the prompt file and start tweaking. The workflow that actually scales is trace → reproduce → fix → re-eval → ship, with every step anchored to a persistent artifact in your observability layer. In this post I walk through exactly how my team runs that loop with LangSmith — from a Slack ping at 10:42 a.m. to a green deploy at 3:15 p.m. the same day, with the regression case permanently pinned into a dataset so we never ship that bug again. The cost of getting this loop right: about a day of plumbing. The cost of not getting it right: every junior engineer reinventing prompt-tweak roulette every time a customer complains.
I have lost count of the number of agent post-mortems I have read where the resolution section says, in essence, "added a sentence to the system prompt." That is not a fix. That is a wish. Without a captured trace, a reproducible test case, and a re-evaluation gate, you have no idea whether the change helped, hurt, or just moved the failure mode somewhere else.
Agents are not deterministic functions. The same input can produce different outputs across temperature settings, model versions, tool latencies, and retrieval results. When a customer reports "the agent gave me wrong appointment availability," the only honest debugging stance is: I do not know what happened until I see the exact trace. Everything before that is guesswork dressed up as senior engineering judgment.
The workflow below is what I run on every reported defect on our voice and chat agent platform, regardless of whether the customer noticed or our online evals caught it first.
flowchart LR
A[User complaint or alert] --> B[Find trace in LangSmith]
B --> C[Reproduce locally from trace inputs]
C --> D[Add failing case to regression dataset]
D --> E[Fix code or prompt]
E --> F[Run evaluate() against dataset]
F -->|score < threshold| E
F -->|score >= threshold| G[Open PR with experiment link]
G --> H[CI re-runs eval as merge gate]
H --> I[Deploy + watch online evals]
I -->|regression| B
I -->|stable| J[Close incident — case stays in dataset forever]
style A fill:#fee
style J fill:#cfc
style F fill:#ffd
Figure 1 — The closed-loop observability workflow. The critical property: every incident leaves a permanent artifact (a dataset row) so the same regression cannot ship twice.
The five steps:
evaluate() against the dataset and confirm the score moves the right direction without breaking other cases.This is the step most teams underinvest in. If your agent does not log a structured trace ID into the same place your support team reads tickets, you will spend the first 30 minutes of every incident hunting. We log the LangSmith run_id into Postgres next to the conversation row, and surface it as a one-click "View trace" link inside our internal admin console.
When the support engineer flags a ticket, the trace is one click away. From there, LangSmith gives us:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
| Without trace anchoring | With trace anchoring |
|---|---|
| "It said something weird around lunch" | run_id a3f9... at 12:14:08 UTC |
| 30+ min searching logs | 1 click from support ticket |
| Reproduction is statistical | Reproduction is deterministic |
| Fix is a guess | Fix is a measured delta |
Once we have the trace, we replay it. LangSmith stores the inputs, but we reconstruct the environment in a notebook or test runner. The pattern looks like this:
from langsmith import Client
from my_agent import build_agent
client = Client()
# Pull the failing run
run = client.read_run("a3f9-...-run-id")
inputs = run.inputs # the original user message + context
expected = run.outputs # what the agent actually produced (the bad answer)
# Rebuild the agent at the same version + model
agent = build_agent(
model="gpt-4o-2024-08-06",
agent_version=run.extra["metadata"]["agent_version"],
)
# Replay
replay = agent.invoke(inputs)
print("Original bad output:", expected)
print("Replay output: ", replay)
If the replay reproduces the bug, great — we have a deterministic failure. If it does not, we have a non-determinism problem (temperature, retrieval drift, tool flakiness). Either way, we now know which class of bug we are dealing with, which dictates the fix strategy.
A practical tip: pin the model version with a date stamp (gpt-4o-2024-08-06, not gpt-4o). Floating model aliases are the single most common source of "I cannot reproduce" reports across the agent teams I have advised.
This is the step that turns a one-off bugfix into a durable defense. LangSmith lets you take any run and add it directly to a dataset, with the inputs preserved and a reference output you supply.
from langsmith import Client
client = Client()
# Either upsert the dataset or load the existing one
ds_name = "voice-agent-regression-suite"
try:
ds = client.create_dataset(ds_name, description="All shipped regressions")
except Exception:
ds = client.read_dataset(dataset_name=ds_name)
# Promote the failing run into the dataset
client.create_example(
dataset_id=ds.id,
inputs=inputs,
outputs={"reference_answer": "Available slots are Tue 3pm and Thu 11am."},
metadata={
"incident_id": "INC-2841",
"promoted_from_run": "a3f9-...-run-id",
"category": "appointment-availability",
},
)
The reference output is the correct answer, written by a human (usually the engineer or a domain expert from the healthcare or real estate team that owns the agent). Once this row is in the dataset, every future evaluation run will exercise it. The bug becomes a permanent test case.
We currently have 412 regression rows in our voice agent dataset and 287 in chat. Each one represents a real customer complaint that, once shipped, will never quietly come back without the CI gate screaming.
The actual code change is usually small — a prompt tweak, a tool argument fix, a retrieval threshold change. The discipline is in measuring it. Every fix attempt is a LangSmith Experiment, run against the dataset:
from langsmith import evaluate
from my_evaluators import factual_match, no_hallucination, latency_ok
def my_agent_pred(inputs: dict) -> dict:
return {"output": agent.invoke(inputs)}
results = evaluate(
my_agent_pred,
data="voice-agent-regression-suite",
evaluators=[factual_match, no_hallucination, latency_ok],
experiment_prefix="fix-INC-2841-tighter-availability-prompt",
metadata={"branch": "fix/availability-prompt", "commit": "9cd14e2"},
max_concurrency=8,
)
print(results.to_pandas().describe())
LangSmith's Experiments view lets us diff this run against the previous baseline experiment side-by-side. We look at three things, in order:
Only when all three are green do we open the PR.
Different evaluators serve different purposes:
| Evaluator | Type | What it catches |
|---|---|---|
factual_match |
LLM-as-judge | Wrong appointment slot, wrong policy quoted |
no_hallucination |
RAG groundedness | Output not supported by retrieved docs |
latency_ok |
Heuristic | p95 above 1.2s |
tool_call_correct |
Structural | Wrong function called or wrong args |
tone_appropriate |
LLM-as-judge | Rude, robotic, off-brand |
Mixing structural and judge-based evaluators is what gives you a defensible "score." A 95% pass rate on factual_match alone tells you almost nothing if tool_call_correct is at 60%.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The PR description includes a link to the LangSmith experiment, the commit, and the diff in scores. CI re-runs the eval suite (more on this in our companion piece on continuous evaluation in CI/CD) and blocks merge if any regression dataset row drops below threshold.
After deploy we watch online evals — LangSmith's automatic evaluators that run against a sample of live production traffic — for 24 to 48 hours. If the same evaluator that flagged the regression starts flagging real traffic at an elevated rate, we roll back. If the rate stays at or below baseline, we close the incident.
The whole loop, for a typical bug, takes 4–6 hours of engineering time. The first time we ran it it took two days because nothing was wired up. Now it is muscle memory.
Across Q1 2026 on our voice and chat agents:
This workflow is not free. The costs:
The alternative is shipping bugs you have already shipped before, hoping nobody notices, and arguing in retros about whether the prompt change "felt" better. I will take the LangSmith bill every time.
You can, and a few teams do. The reason we standardized on LangSmith is the graph view of the run tree and the dataset → experiment loop being a single primitive. With raw logs you reconstruct the tree by hand and rebuild the comparison harness yourself. The point of this workflow is that the artifacts are the same shape end-to-end: a run, a dataset row, an experiment, a comparison. Tooling that has those four things native saves weeks.
We started seeing leverage at around 30 rows. Below that, the eval is too noisy to trust as a gate. Above 100 rows, the gate is sharp enough that engineers stop arguing with it. Above 500 rows, you start needing to think about evaluator cost and split into smoke vs. full suites.
Real problem. We mitigate three ways: (1) pin the judge model, (2) run each judged example three times and majority-vote, (3) calibrate the judge against a human-labeled sample quarterly. If a judge's agreement with humans drops below 80%, we retire it.
LangSmith supports tag-based redaction and project-level access controls. We redact at the tracer-callback level (replace phone numbers, emails, full names with hashed tokens) before the trace ever leaves our VPC. The reproduction step then uses the redacted trace; the reference answer is written generically.
Yes, with one extra step. We trace the STT transcript as the canonical input, and the dataset rows are text. For audio-specific bugs (e.g., the agent misheard "Bayer" as "buyer"), we add the audio file as an attached artifact and the reference transcript as the reference output, then run a separate STT-quality evaluator. This composability is why our voice and chat platform uses the same trace schema for both modalities.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for multilingual customer support — a May 2026 comparison grounded in current model prices, benchm...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI