From Trace to Production Fix: An End-to-End Observability Workflow for Agents

TL;DR

Most agent bugs in production look the same on the surface — "the bot said the wrong thing" — and the worst possible debugging response is to open the prompt file and start tweaking. The workflow that actually scales is trace → reproduce → fix → re-eval → ship, with every step anchored to a persistent artifact in your observability layer. In this post I walk through exactly how my team runs that loop with LangSmith — from a Slack ping at 10:42 a.m. to a green deploy at 3:15 p.m. the same day, with the regression case permanently pinned into a dataset so we never ship that bug again. The cost of getting this loop right: about a day of plumbing. The cost of not getting it right: every junior engineer reinventing prompt-tweak roulette every time a customer complains.

Why "Just Fix the Prompt" Is the Wrong Reflex

I have lost count of the number of agent post-mortems I have read where the resolution section says, in essence, "added a sentence to the system prompt." That is not a fix. That is a wish. Without a captured trace, a reproducible test case, and a re-evaluation gate, you have no idea whether the change helped, hurt, or just moved the failure mode somewhere else.

Agents are not deterministic functions. The same input can produce different outputs across temperature settings, model versions, tool latencies, and retrieval results. When a customer reports "the agent gave me wrong appointment availability," the only honest debugging stance is: I do not know what happened until I see the exact trace. Everything before that is guesswork dressed up as senior engineering judgment.

The workflow below is what I run on every reported defect on our voice and chat agent platform, regardless of whether the customer noticed or our online evals caught it first.

The Five-Step Loop

flowchart LR
  A[User complaint or alert] --> B[Find trace in LangSmith]
  B --> C[Reproduce locally from trace inputs]
  C --> D[Add failing case to regression dataset]
  D --> E[Fix code or prompt]
  E --> F[Run evaluate&#40;&#41; against dataset]
  F -->|score &lt; threshold| E
  F -->|score >= threshold| G[Open PR with experiment link]
  G --> H[CI re-runs eval as merge gate]
  H --> I[Deploy + watch online evals]
  I -->|regression| B
  I -->|stable| J[Close incident — case stays in dataset forever]
  style A fill:#fee
  style J fill:#cfc
  style F fill:#ffd

Figure 1 — The closed-loop observability workflow. The critical property: every incident leaves a permanent artifact (a dataset row) so the same regression cannot ship twice.

The five steps:

Locate the trace — given a user, a session ID, or a timestamp, find the exact LangSmith run that produced the bad output.
Reproduce — pull the trace inputs, replay locally with the same model + tools, confirm the failure deterministically (or characterize the variance).
Promote to dataset — add the trace as a new example in a regression dataset, with a reference output and at least one evaluator attached.
Fix and re-evaluate — change code, prompt, retrieval, or tool, then run evaluate() against the dataset and confirm the score moves the right direction without breaking other cases.
Ship behind the CI gate — open a PR; CI runs the full eval suite and blocks merge if regression scores drop below threshold; deploy; watch online evals for 24–48 hours.

Step 1 — Find the Trace

This is the step most teams underinvest in. If your agent does not log a structured trace ID into the same place your support team reads tickets, you will spend the first 30 minutes of every incident hunting. We log the LangSmith run_id into Postgres next to the conversation row, and surface it as a one-click "View trace" link inside our internal admin console.

When the support engineer flags a ticket, the trace is one click away. From there, LangSmith gives us:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

The full LLM call tree — every prompt, every tool call, every retrieval.
Latencies per node (so we can tell "wrong answer" from "wrong answer caused by upstream tool timeout").
Token usage per call, which sometimes reveals truncation as the root cause.
Metadata tags (env, model, agent version) we attach at run start.

Without trace anchoring	With trace anchoring
"It said something weird around lunch"	run_id `a3f9...` at 12:14:08 UTC
30+ min searching logs	1 click from support ticket
Reproduction is statistical	Reproduction is deterministic
Fix is a guess	Fix is a measured delta

Step 2 — Reproduce From the Trace

Once we have the trace, we replay it. LangSmith stores the inputs, but we reconstruct the environment in a notebook or test runner. The pattern looks like this:

from langsmith import Client
from my_agent import build_agent

client = Client()

# Pull the failing run
run = client.read_run("a3f9-...-run-id")
inputs = run.inputs            # the original user message + context
expected = run.outputs          # what the agent actually produced (the bad answer)

# Rebuild the agent at the same version + model
agent = build_agent(
    model="gpt-4o-2024-08-06",
    agent_version=run.extra["metadata"]["agent_version"],
)

# Replay
replay = agent.invoke(inputs)
print("Original bad output:", expected)
print("Replay output:      ", replay)

If the replay reproduces the bug, great — we have a deterministic failure. If it does not, we have a non-determinism problem (temperature, retrieval drift, tool flakiness). Either way, we now know which class of bug we are dealing with, which dictates the fix strategy.

A practical tip: pin the model version with a date stamp (gpt-4o-2024-08-06, not gpt-4o). Floating model aliases are the single most common source of "I cannot reproduce" reports across the agent teams I have advised.

Step 3 — Promote the Trace to a Regression Dataset

This is the step that turns a one-off bugfix into a durable defense. LangSmith lets you take any run and add it directly to a dataset, with the inputs preserved and a reference output you supply.

from langsmith import Client

client = Client()

# Either upsert the dataset or load the existing one
ds_name = "voice-agent-regression-suite"
try:
    ds = client.create_dataset(ds_name, description="All shipped regressions")
except Exception:
    ds = client.read_dataset(dataset_name=ds_name)

# Promote the failing run into the dataset
client.create_example(
    dataset_id=ds.id,
    inputs=inputs,
    outputs={"reference_answer": "Available slots are Tue 3pm and Thu 11am."},
    metadata={
        "incident_id": "INC-2841",
        "promoted_from_run": "a3f9-...-run-id",
        "category": "appointment-availability",
    },
)

The reference output is the correct answer, written by a human (usually the engineer or a domain expert from the healthcare or real estate team that owns the agent). Once this row is in the dataset, every future evaluation run will exercise it. The bug becomes a permanent test case.

We currently have 412 regression rows in our voice agent dataset and 287 in chat. Each one represents a real customer complaint that, once shipped, will never quietly come back without the CI gate screaming.

Step 4 — Fix, Re-Evaluate, and Compare Experiments

The actual code change is usually small — a prompt tweak, a tool argument fix, a retrieval threshold change. The discipline is in measuring it. Every fix attempt is a LangSmith Experiment, run against the dataset:

from langsmith import evaluate
from my_evaluators import factual_match, no_hallucination, latency_ok

def my_agent_pred(inputs: dict) -> dict:
    return {"output": agent.invoke(inputs)}

results = evaluate(
    my_agent_pred,
    data="voice-agent-regression-suite",
    evaluators=[factual_match, no_hallucination, latency_ok],
    experiment_prefix="fix-INC-2841-tighter-availability-prompt",
    metadata={"branch": "fix/availability-prompt", "commit": "9cd14e2"},
    max_concurrency=8,
)
print(results.to_pandas().describe())

LangSmith's Experiments view lets us diff this run against the previous baseline experiment side-by-side. We look at three things, in order:

Did the failing case pass? (If not, the fix is wrong.)
Did any previously passing cases regress? (If yes, the fix is too narrow or has a side effect.)
Did the aggregate scores improve? (Sanity check.)

Only when all three are green do we open the PR.

What "Score" Means in Practice

Different evaluators serve different purposes:

Evaluator	Type	What it catches
`factual_match`	LLM-as-judge	Wrong appointment slot, wrong policy quoted
`no_hallucination`	RAG groundedness	Output not supported by retrieved docs
`latency_ok`	Heuristic	p95 above 1.2s
`tool_call_correct`	Structural	Wrong function called or wrong args
`tone_appropriate`	LLM-as-judge	Rude, robotic, off-brand

Mixing structural and judge-based evaluators is what gives you a defensible "score." A 95% pass rate on factual_match alone tells you almost nothing if tool_call_correct is at 60%.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Step 5 — Ship It Behind the Gate

The PR description includes a link to the LangSmith experiment, the commit, and the diff in scores. CI re-runs the eval suite (more on this in our companion piece on continuous evaluation in CI/CD) and blocks merge if any regression dataset row drops below threshold.

After deploy we watch online evals — LangSmith's automatic evaluators that run against a sample of live production traffic — for 24 to 48 hours. If the same evaluator that flagged the regression starts flagging real traffic at an elevated rate, we roll back. If the rate stays at or below baseline, we close the incident.

The whole loop, for a typical bug, takes 4–6 hours of engineering time. The first time we ran it it took two days because nothing was wired up. Now it is muscle memory.

Real Numbers From One Quarter

Across Q1 2026 on our voice and chat agents:

47 incidents triaged through this exact workflow.
41 resulted in a permanent dataset addition (the other 6 turned out to be user error or platform issues, not agent defects).
3 of the 41 fixes were caught and reverted by online evals before customers noticed wider impact — meaning the gate worked twice and the loop closed once.
Mean time from complaint to verified fix: 5h 12m. First quarter we measured this it was 19h 40m.
Zero "we shipped that bug again" incidents. The dataset prevents it structurally.

Honest Tradeoffs

This workflow is not free. The costs:

LLM-as-judge evaluators are not cheap. A full eval suite run on our 700-row dataset costs about $4.20 in OpenAI credits and takes 6 minutes. CI runs it on every PR that touches the agent. That is real money. (We use a smaller "smoke" subset of 80 rows for non-agent PRs.)
Reference outputs require domain expertise. You cannot have a prompt engineer write the "correct" answer to a healthcare scheduling question. You need a clinician. We bake this into our glossary of terms and SOP docs.
The dataset will grow forever. We rotate out cases that are no longer reachable (e.g., the underlying tool was removed) about once a quarter, but otherwise it just gets bigger. Plan for evaluator cost to grow ~30% YoY.
Online eval coverage is a sampling tradeoff. We run online evals on 5% of traffic. Higher fidelity costs more. Lower fidelity misses more.

The alternative is shipping bugs you have already shipped before, hoping nobody notices, and arguing in retros about whether the prompt change "felt" better. I will take the LangSmith bill every time.

Frequently Asked Questions

Why anchor everything to LangSmith specifically — can I do this with raw logs?

You can, and a few teams do. The reason we standardized on LangSmith is the graph view of the run tree and the dataset → experiment loop being a single primitive. With raw logs you reconstruct the tree by hand and rebuild the comparison harness yourself. The point of this workflow is that the artifacts are the same shape end-to-end: a run, a dataset row, an experiment, a comparison. Tooling that has those four things native saves weeks.

How big should the regression dataset be before this is worth it?

We started seeing leverage at around 30 rows. Below that, the eval is too noisy to trust as a gate. Above 100 rows, the gate is sharp enough that engineers stop arguing with it. Above 500 rows, you start needing to think about evaluator cost and split into smoke vs. full suites.

What about flaky LLM-as-judge evaluators?

Real problem. We mitigate three ways: (1) pin the judge model, (2) run each judged example three times and majority-vote, (3) calibrate the judge against a human-labeled sample quarterly. If a judge's agreement with humans drops below 80%, we retire it.

How do I handle traces with PII in them?

LangSmith supports tag-based redaction and project-level access controls. We redact at the tracer-callback level (replace phone numbers, emails, full names with hashed tokens) before the trace ever leaves our VPC. The reproduction step then uses the redacted trace; the reference answer is written generically.

Does this work for voice agents where input is audio, not text?

Yes, with one extra step. We trace the STT transcript as the canonical input, and the dataset rows are text. For audio-specific bugs (e.g., the agent misheard "Bayer" as "buyer"), we add the audio file as an attached artifact and the reference transcript as the reference output, then run a separate STT-quality evaluator. This composability is why our voice and chat platform uses the same trace schema for both modalities.