TL;DR

You cannot debug an agent from logs. Logs tell you what happened; traces tell you why. A trace is a tree of spans — every LLM call, every tool invocation, every retrieval — with timing, inputs, outputs, and parent-child relationships intact. A session stitches multiple traces together across turns of the same conversation. Together, spans and sessions surface the failure modes that aggregated metrics hide: tool-selection drift, silent retrieval misses, prompt-injection that survives a refusal, latency that comes from one slow tool inside an otherwise fast agent. This post covers the span model, the session model, the patterns I look for, and runnable examples in the LangSmith Python and TypeScript SDKs.

If you are operating an agent in production without per-step traces, you are flying without instruments. The cost of fixing that is one decorator.

What a Trace Actually Is

A trace is the complete record of one logical agent execution. In LangSmith terms, it is a tree of Run objects rooted at one top-level run with is_root=true. Every child operation — an LLM call, a tool call, a retriever, a sub-chain, a custom function — is its own Run with a parent pointer.

The minimum information a span carries:

Field	Example	Why it matters
`run_type`	`llm`, `tool`, `retriever`, `chain`	Lets you filter and aggregate by class
`name`	`gpt-4o-mini`, `search_kb`	Human-readable identity
`inputs`	`{"messages": [...]}`	Reproducible from this alone
`outputs`	`{"choices": [...]}`	Plus token counts, cost
`start_time` / `end_time`	UTC, microsecond	Span duration is the diff
`parent_run_id`	UUID	Builds the tree
`error`	str or null	Caught exceptions
`metadata`	`{"user_id": "u_482"}`	What you'll filter by later
`tags`	`["prod", "tier:enterprise"]`	Cheap categorical slicing

The key insight: every span is independently queryable, but the value is in the tree. Asking "what happened in this conversation?" is a session question. Asking "why did this turn take 8 seconds?" is a span-tree question. Asking "which tool fails most often when the user is angry?" is a metadata-and-aggregation question. All three need the same underlying data, structured as a tree.

Span Hierarchy: The Anatomy of One Agent Turn

Here is what a single turn of a tool-using agent looks like as a span tree. This is not a toy — it is the shape you actually see in production.

graph TD
  A[ROOT: agent_turn<br/>run_type=chain<br/>2.4s] --> B[plan_step<br/>run_type=llm<br/>gpt-4o<br/>620ms]
  A --> C[tool_call: search_kb<br/>run_type=tool<br/>340ms]
  C --> C1[retriever: vector_db<br/>run_type=retriever<br/>180ms]
  C --> C2[rerank<br/>run_type=chain<br/>140ms]
  A --> D[tool_call: get_account<br/>run_type=tool<br/>90ms]
  A --> E[synthesis_step<br/>run_type=llm<br/>gpt-4o<br/>1.3s]
  E --> E1[guardrail: pii_filter<br/>run_type=chain<br/>40ms]
  E --> E2[guardrail: policy_check<br/>run_type=chain<br/>110ms]
  style A fill:#e6f3ff
  style B fill:#fff4e6
  style E fill:#fff4e6
  style C fill:#e8f5e8
  style D fill:#e8f5e8

Figure 1 — One agent turn produces a tree of 8 spans across 4 run types. The root span aggregates the timing; the leaves expose where the time and tokens went.

A few things this picture makes obvious that a flat log never will:

Critical path is visible. The 2.4s root is mostly the 1.3s synthesis LLM call. Optimizing the 340ms tool call would save you nothing the user perceives.
Tool latency is decomposable. search_kb is 340ms, of which 180ms is vector retrieval and 140ms is reranking. If you swap the reranker, you have a baseline.
Guardrails are separable. The two post-synthesis guardrails are 40ms and 110ms. You can ablate them in eval and measure quality vs. latency tradeoff.
Parallelism is auditable. Did search_kb and get_account actually run in parallel? Look at the start_time deltas. If they were sequential, you are leaving 90ms on the table per turn.

Sessions: Multi-Turn Correlation

A trace covers one turn. A session covers a whole conversation. In LangSmith, sessions are constructed by stamping every root run in a conversation with the same session_id (also called thread_id in some clients) in metadata or via the dedicated session field.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Why this matters: most production failures are not single-turn failures. They are failures of consistency across turns. The agent picks the right tool on turn 1, forgets context on turn 4, contradicts itself on turn 7. You cannot see any of that without correlating the traces.

import os
from langsmith import traceable, Client
from langsmith.run_helpers import get_current_run_tree

client = Client()

@traceable(run_type="chain", name="agent_turn")
def agent_turn(user_msg: str, session_id: str, history: list):
    """One turn of the agent. The traceable decorator emits the root span."""
    run = get_current_run_tree()
    # Stamp this run with the conversation id so sessions stitch
    run.add_metadata({"session_id": session_id, "turn_index": len(history)})

    # ... real agent logic here: plan, tool, synthesize ...
    response = call_model(history + [{"role": "user", "content": user_msg}])
    return response

# A whole conversation produces N traces, all stamped with the same session_id.
session_id = "sess_2026_05_06_482"
history = []
for user_msg in ["I need to reset my password.",
                 "I tried that, it says my email isn't found.",
                 "Yes, the email is [email protected]."]:
    response = agent_turn(user_msg, session_id, history)
    history += [{"role": "user", "content": user_msg},
                {"role": "assistant", "content": response}]

Now, in LangSmith, you filter metadata.session_id eq "sess_2026_05_06_482" and you get the entire conversation as one orderable list of trace trees. That is the unit of debugging for any agent that lives longer than one HTTP request.

The Failure Modes Only Tracing Reveals

This is the part that pays for the instrumentation. Six failure patterns I find in real production agents, all of which look fine in dashboards and obvious in traces.

1. Tool-Selection Drift

The agent has 12 tools. After a model upgrade, one specific tool gets called 30% less. Aggregate success metrics barely move because the fallback tool kind-of-works. Traces show the drift immediately:

from langsmith import Client
client = Client()

# Tool-call distribution before vs after a model swap
runs = client.list_runs(
    project_name="prod",
    run_type="tool",
    filter='and(gte(start_time, "2026-04-15"), lte(start_time, "2026-05-01"))',
)
from collections import Counter
print(Counter(r.name for r in runs).most_common(15))

You compare the histograms across two windows. If search_internal_kb dropped from 22% of calls to 9%, you have your culprit. No dashboard exposed that; the per-tool span data did.

2. Silent Retrieval Misses

The retriever runs, returns 0 results, and the LLM hallucinates an answer instead of saying "I don't know." End-user satisfaction drops 4 points and nobody knows why. In traces, the retrieval span has outputs.documents = [] and the next LLM span just barrels ahead.

The fix is a 3-line filter on retrieval spans where the output array is empty, alerted on if the count exceeds 1% of retrievals. That alert has caught more silent regressions in our deployments than any LLM judge has.

3. Recovered Prompt Injection

A user attempts a prompt injection. The system prompt + a guardrail catch it on turn 3. The agent refuses correctly. Looks like a win — except in turn 5, deep in the tool-call chain, you see a tool getting parameters that match the injected instruction. The refusal text was clean; the behavior leaked. Only span-level inspection of every tool call in the session catches this.

4. Latency from One Slow Sub-Tool

p99 turn latency is 12 seconds. Median is 1.8 seconds. The dashboards say "p99 is bad," which everyone knew. Span data says the slow turns all share one specific tool — fetch_account_history — that sometimes calls a downstream API which sometimes does a full table scan. You wouldn't have known without span timing.

import { Client } from "langsmith";

const client = new Client();

// Find the spans contributing most to p99 latency
const runs = client.listRuns({
  projectName: "prod",
  filter: 'and(eq(run_type, "tool"), gt(latency, 5))',
  startTime: new Date(Date.now() - 24 * 3600 * 1000),
});

const byTool: Record<string, number[]> = {};
for await (const r of runs) {
  byTool[r.name] ??= [];
  byTool[r.name].push(r.total_time ?? 0);
}
for (const [name, times] of Object.entries(byTool)) {
  times.sort((a, b) => a - b);
  const p99 = times[Math.floor(times.length * 0.99)] ?? 0;
  console.log(name, "p99=", p99, "n=", times.length);
}

5. Cost Spikes from One Bad Prompt Edit

Token usage triples overnight. Aggregate cost spikes. Where? Filter LLM spans by metadata.prompt_version, group by version, plot tokens-per-call. The new prompt added 800 tokens of CoT scaffolding to every call. One revert later, costs are back. Without span-level token accounting tagged with prompt version, this is a multi-day investigation. With it, it's lunch.

6. Cross-Turn Contradictions

The agent says "your account is suspended" on turn 2 and "your account is active" on turn 6. CSAT tanks but no individual turn looks broken. Sessions catch this — you run a cross-turn consistency evaluator over every session that hit a low CSAT and you find that 18% of low-CSAT sessions contain at least one direct contradiction. That becomes a regression target.

The Trace-to-Eval Pipeline

Tracing isn't just for incident response. It is the raw material for evaluation. The flow:

flowchart LR
  A[Live agent] -->|@traceable| B[LangSmith Traces]
  B --> C{Filter}
  C -->|low feedback| D[Failure pool]
  C -->|tool error| D
  C -->|p99 latency| D
  C -->|random sample| E[Drift pool]
  D --> F[Annotation Queue]
  F --> G[Golden Dataset]
  G --> H[Evaluators on PR]
  H --> I[Deploy gate]
  I --> A
  E --> J[Distribution monitors]
  J --> K[Alerts]
  style B fill:#e6f3ff
  style G fill:#e8f5e8
  style I fill:#fff4e6

Figure 2 — Traces are the input to the entire evaluation loop. No traces, no golden dataset, no regression gate.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Instrumenting an Agent: Minimum Viable Tracing

The single decorator that makes everything else work, in Python and TypeScript:

from langsmith import traceable

@traceable(run_type="chain", name="agent_turn")
def agent_turn(messages: list, tools: list) -> dict:
    plan = plan_step(messages)              # nested @traceable call
    results = []
    for call in plan["tool_calls"]:
        results.append(invoke_tool(call))   # nested @traceable call
    return synthesize(messages, results)    # nested @traceable call

@traceable(run_type="llm", name="plan_step")
def plan_step(messages):
    return openai_client.chat.completions.create(
        model="gpt-4o", messages=messages,
    ).model_dump()

@traceable(run_type="tool", name="invoke_tool")
def invoke_tool(call):
    return tool_registry[call["name"]](**call["args"])

import { traceable } from "langsmith/traceable";

const planStep = traceable(
  async (messages: any[]) => openai.chat.completions.create({
    model: "gpt-4o", messages,
  }),
  { runType: "llm", name: "plan_step" },
);

const invokeTool = traceable(
  async (call: { name: string; args: any }) => toolRegistry[call.name](call.args),
  { runType: "tool", name: "invoke_tool" },
);

export const agentTurn = traceable(
  async (messages: any[]) => {
    const plan = await planStep(messages);
    const results = await Promise.all(
      plan.choices[0].message.tool_calls?.map(invokeTool) ?? [],
    );
    return synthesize(messages, results);
  },
  { runType: "chain", name: "agent_turn" },
);

That is it. With LANGSMITH_TRACING=true and LANGSMITH_API_KEY set, every call produces a tree, every span is queryable, every session is reconstructible. The total code change is roughly 12 lines.

What to Tag, and Why

Tagging is where rookies under-invest. The spans you don't tag are the spans you can't slice. My standard tagging surface for any production agent:

Tag/Metadata	Example value	What it unlocks
`metadata.user_id`	`u_482`	Per-user incident replay
`metadata.session_id`	`sess_...`	Conversation reconstruction
`metadata.tenant`	`acme_corp`	Per-customer SLO tracking
`metadata.prompt_version`	`v117`	Cost/quality regression by prompt
`metadata.model`	`gpt-4o-2026-04`	Model swap A/B
`tags`	`["prod", "tier:enterprise"]`	Cheap categorical filters
`metadata.feature_flags`	`["new_router"]`	Experiment cohort isolation

Without these, your traces are forensically rich but operationally blind. You can debug one call beautifully and answer zero questions about a population.

Common Mistakes I See

Tracing only the LLM call. The interesting failures are in the orchestration around it — tool selection, retrieval, guardrails. Trace the chain, not just the model.
Untagged sessions. A trace without a session_id is half-useful. Set it on the root span every turn, religiously.
Sampling too aggressively. "We sample 1% of traces" sounds responsible until your CFO-tier customer hits a bug and you have nothing. Sample 100% in prod for at least 30 days; then decide if you need to throttle.
Sensitive data in spans. messages may contain PHI, payment info, secrets. Use a pre-export redactor; LangSmith supports hooks. This is a compliance bug waiting to happen.
No retention policy. Traces are cheap individually and expensive in bulk. Decide on retention (30/90/365 days) up front, document it, expose it to InfoSec.

How CallSphere Uses Tracing in Production

Inside CallSphere, every voice and chat turn produces a full trace tree — STT span, LLM span, tool spans, TTS span, telephony span — stitched into a session per conversation. We surface the worst 0.5% of sessions to the dashboard automatically, so the operations team grades real failures, not random samples. The same traces feed the golden dataset on a weekly cycle. That loop is the difference between an agent that improves measurably and an agent that drifts invisibly.

If you are running production agents across healthcare, real estate, sales, salon, after-hours, or IT helpdesk, the operating model that scales is the same: trace everything, session everything, mine traces for failures, gate deploys on golden-set regressions. Skip any step and the next incident will surprise you.

FAQ

Q: Should I trace 100% of production traffic? A: For the first 30 days, yes. Decide on sampling only after you have a real picture of trace volume and storage cost. Until then you don't know what you'll need.

Q: Won't tracing add latency? A: A correctly batched, async exporter adds ~1–3ms per span at p99 in our deployments. The LangSmith client batches and uploads in the background. Synchronous tracing in the critical path is a misconfiguration, not a property of tracing.

Q: How do I correlate a LangSmith session with a Datadog APM trace? A: Stamp both with the same correlation id — typically your request id or a generated UUID — in metadata on the LangSmith side and as a tag on the APM side. Then either side becomes a starting point for the other.

Q: Can I trace voice agents the same way? A: Yes. The trace shape is identical: STT span, LLM span(s), tool span(s), TTS span, with the root span representing one turn. The only voice-specific addition is timing of audio frames, which you can attach as metadata to the STT span.

Q: How long should I retain traces? A: 30 days for full-fidelity, 90 days for hashed/aggregated. PII-heavy verticals (healthcare, finance) often shorten this; expose the policy to InfoSec and align with your DPA. Whatever you pick, automate it — manual retention always slips.

Trace Your Agents With CallSphere

If you want a production-grade tracing surface that already speaks LangSmith conventions, with sessions stitched across voice and chat, see the products page or the observability glossary entry. Bring your worst session — we will trace it together.

Book a demo · See products · Browse the glossary

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

TL;DR

What a Trace Actually Is

Span Hierarchy: The Anatomy of One Agent Turn

Sessions: Multi-Turn Correlation

The Failure Modes Only Tracing Reveals

1. Tool-Selection Drift

2. Silent Retrieval Misses

3. Recovered Prompt Injection

4. Latency from One Slow Sub-Tool

5. Cost Spikes from One Bad Prompt Edit

6. Cross-Turn Contradictions

The Trace-to-Eval Pipeline

Instrumenting an Agent: Minimum Viable Tracing

What to Tag, and Why

Common Mistakes I See

How CallSphere Uses Tracing in Production

FAQ

Trace Your Agents With CallSphere

Try CallSphere AI Voice Agents

Related Articles You May Like

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

Regression Testing for AI Agents: Catching Silent Breakage Before Users Do

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How to Build a Golden Dataset for Production AI Agents

Evaluating Multi-Step Tool-Using Agents: Why End-to-End Metrics Lie