By Sagar Shankaran, Founder of CallSphere
Final-answer accuracy hides broken reasoning. Build an eval pipeline that scores the reasoning trace itself — coherence, faithfulness to tools, dead-end detection.
Key takeaways
Final-answer-only evals are the agent equivalent of grading code by whether it compiles. You will catch the catastrophic failures and miss every silent class of bug — agents that arrive at the right answer through fabricated tool outputs, confused intermediate goals, or pure luck. As reasoning models like o3-2025-04-16 and gpt-5-2025-04-14 move into production agent loops, the trace itself becomes evaluable signal: a sequence of tool calls, intermediate assertions, and (where exposed) reasoning summaries you can score for coherence, faithfulness, redundancy, and tool-grounding. This post is the architecture for a trace-evaluation pipeline, a working LangSmith evaluate() example with a custom trace evaluator, and the rubric we run in production across the agents on CallSphere. It catches roughly 3× the regressions our final-answer evals miss alone.
Three categories of bug routinely sneak past final-answer-correctness checks:
1. Right answer, wrong reasoning. The agent fabricates an intermediate fact ("the patient's insurance is Aetna" when the tool never returned that), then happens to land on a correct final answer because the user's actual insurance was Aetna. Score on final answer: pass. Score on faithfulness: fail. The next session with a different insurance, the agent fabricates again and ships a wrong recommendation.
2. Dead-end loops. The agent calls tool A, gets a result, then calls tool A again with the same args, then again, then finally calls tool B and proceeds. Final answer correct, latency 4× target, cost 4× target. Final-answer eval: pass. Trace eval: large redundancy penalty.
3. Hallucinated tool outputs. Particularly common with chatty fast models in the executor role. The agent "remembers" a tool result that was never actually returned by the tool — it's confabulating from prior context. Caught only by tool-grounding evaluators that diff what the trace claims against what the tool actually emitted.
In our internal benchmarking across the agents serving our healthcare and IT-helpdesk verticals, trace-quality evaluators caught 3.1× as many real defects as final-answer-only evaluators on the same dataset. Most of the defects were silent: customers got plausible answers backed by faulty reasoning, and the bugs only surfaced when reasoning patterns drifted enough to occasionally produce wrong final answers too.
Reasoning-trace evaluation is not the same as final-answer evaluation, and confusing the two leads to nonsensical metrics. Two orthogonal axes:
| Axis | Question | What signals it |
|---|---|---|
| Final-answer correctness | Did the agent give the user the right output? | Reference answer match, factual judge, schema validation |
| Reasoning-trace faithfulness | Did the agent get there through valid reasoning? | Coherence, tool-grounding, redundancy, dead-end rate |
You want both green. A trace that is internally coherent and tool-grounded but produces a wrong final answer means your tools or knowledge base are broken. A trace that is incoherent but produces a correct final answer means you got lucky and you'll regress unpredictably.
flowchart TD
A[Agent run completes] --> B[Extract trace from LangSmith]
B --> C[Normalize: tool calls, results, reasoning summary]
C --> D[Trace evaluator suite]
D --> E1[Coherence judge]
D --> E2[Tool-grounding check]
D --> E3[Redundancy detector]
D --> E4[Dead-end detector]
D --> E5[Constraint-satisfaction judge]
E1 --> F[Aggregate trace_score]
E2 --> F
E3 --> F
E4 --> F
E5 --> F
F --> G{trace_score < threshold?}
G -->|yes| H[Flag for human review]
G -->|no| I[Pass]
H --> J[Add to regression dataset]
style F fill:#ffd
style H fill:#fcc
style I fill:#cfc
Figure 1 — The trace-eval pipeline. Each evaluator is independent and contributes to an aggregated trace_score; the dead-end and tool-grounding checks are deterministic, the rest are LLM-as-judge.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The pipeline is deliberately mixed: deterministic structural checks where we can afford them (cheap, low-noise) and judge-based checks where structure isn't enough (expressive, more expensive).
For OpenAI reasoning models, you get three layers of signal:
{tool, args, result} triples.reasoning.summary). The full reasoning is not exposed; the summary is.LangSmith captures all three when you trace the agent end-to-end. Pull a run:
from langsmith import Client
client = Client()
run = client.read_run("a3f9-...-run-id", load_child_runs=True)
trace = []
for child in run.child_runs:
if child.run_type == "tool":
trace.append({
"type": "tool",
"name": child.name,
"args": child.inputs,
"result": child.outputs,
"latency_ms": child.total_time * 1000,
})
elif child.run_type == "llm":
msg = child.outputs.get("generations", [[{}]])[0][0]
reasoning = msg.get("message", {}).get("reasoning_summary")
trace.append({
"type": "llm",
"model": child.extra.get("metadata", {}).get("ls_model_name"),
"reasoning_summary": reasoning,
"content": msg.get("text"),
})
That trace list — flat, ordered, with both tool I/O and reasoning summaries — is the input to every downstream evaluator.
Each one is a small, focused evaluator. Composability beats one mega-judge.
Does the trace tell a story that makes sense? Each step's intent should follow from the previous step's outcome.
COHERENCE_PROMPT = """You are evaluating the COHERENCE of an agent's reasoning trace.
Rules:
- Each step's intent must logically follow from the prior step's outcome.
- Goal shifts mid-trace WITHOUT a triggering observation are incoherent.
- Mark coherent (1.0), partially coherent (0.5), incoherent (0.0).
- Output JSON: {score: float, rationale: str}"""
def coherence_evaluator(trace: list[dict]) -> dict:
resp = judge.chat.completions.create(
model="gpt-4o-2024-08-06", # PIN the judge model
messages=[
{"role": "system", "content": COHERENCE_PROMPT},
{"role": "user", "content": json.dumps(trace, indent=2)},
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(resp.choices[0].message.content)
Does every factual claim the agent makes trace back to a tool result? This is structural — we extract claims from the agent's narration and check whether each appears in some prior tool's output.
def tool_grounding_evaluator(trace: list[dict]) -> dict:
tool_outputs = [step["result"] for step in trace if step["type"] == "tool"]
pool = json.dumps(tool_outputs).lower()
# Use a small LLM to extract factual claims from the agent's narration
claims = extract_claims([s for s in trace if s["type"] == "llm"])
grounded = sum(1 for c in claims if claim_supported(c, pool))
score = grounded / max(len(claims), 1)
return {
"score": score,
"rationale": f"{grounded}/{len(claims)} claims grounded in tool output",
}
claim_supported is fuzzy — substring + small-LLM entailment check. The dollar amount, date, name, or ID in the claim must appear (or be entailed) somewhere in the tool output pool.
Counts repeated tool calls with identical or near-identical args. Three calls to search_appointments with the same date is almost always a dead-end loop.
def redundancy_evaluator(trace: list[dict]) -> dict:
calls = [(s["name"], json.dumps(s["args"], sort_keys=True))
for s in trace if s["type"] == "tool"]
dups = len(calls) - len(set(calls))
score = max(0.0, 1.0 - (dups / max(len(calls), 1)))
return {"score": score, "rationale": f"{dups} duplicate tool calls"}
A dead-end is a tool call whose result is never used. We trace data flow: each tool result must either feed a subsequent tool's args or appear in the final answer.
For planner-driven agents (see the hybrid reasoning architecture), we extract the user's stated constraints and verify each was respected by the trace. "Aetna insurance" stated as a constraint → at least one tool call in the trace must have filtered or verified for Aetna.
evaluate()The whole point is to run these as part of the same eval loop you already have for final answers.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
from langsmith import evaluate, Client
from langsmith.schemas import Example, Run
def trace_quality_evaluator(run: Run, example: Example) -> dict:
trace = normalize_trace(run) # the extraction shown above
scores = {
"coherence": coherence_evaluator(trace)["score"],
"tool_grounding": tool_grounding_evaluator(trace)["score"],
"redundancy": redundancy_evaluator(trace)["score"],
"dead_end": dead_end_evaluator(trace)["score"],
"constraints": constraint_evaluator(trace, example)["score"],
}
# Weighted aggregate. Tool-grounding is non-negotiable.
weights = {"coherence": 0.20, "tool_grounding": 0.30,
"redundancy": 0.15, "dead_end": 0.15, "constraints": 0.20}
aggregate = sum(scores[k] * weights[k] for k in weights)
return {
"key": "trace_quality",
"score": aggregate,
"comment": json.dumps(scores),
}
def final_answer_evaluator(run: Run, example: Example) -> dict:
# Standard rubric/exact-match against example.outputs
...
results = evaluate(
lambda inp: build_agent().invoke(inp),
data="agent-regression-suite",
evaluators=[trace_quality_evaluator, final_answer_evaluator],
experiment_prefix="trace-eval-2026-05-06",
metadata={"judge": "gpt-4o-2024-08-06", "agent_planner": "o3-2025-04-16"},
max_concurrency=8,
)
The evaluator returns a single trace_quality score plus a JSON comment with the per-dimension breakdown, so the LangSmith UI shows both the rolled-up number and the diagnosis when something regresses.
This is the rubric we hand to the judge model and re-use for human calibration:
| Dimension | 1.0 (pass) | 0.5 (partial) | 0.0 (fail) |
|---|---|---|---|
| Coherence | Every step follows from prior outcome | One unjustified goal shift | Multiple unjustified shifts or contradictory steps |
| Tool-grounding | All factual claims supported by a tool result | One unsupported claim | Multiple fabricated facts |
| Redundancy | Zero duplicate tool calls | 1–2 duplicates | 3+ duplicates or visible loop |
| Dead-end | Every tool result used downstream | One unused result | Multiple unused results |
| Constraint satisfaction | All stated constraints respected | One soft constraint violated | A hard constraint (insurance, time, identity) violated |
We re-calibrate the judge against human labels quarterly on a 60-row sample. Last calibration the judge agreed with humans 87% on coherence, 94% on tool-grounding (deterministic helps), 91% on constraints. Below 80% agreement we retire the judge prompt and rewrite it.
Three real examples from our last six months:
Healthcare scheduler: final-answer accuracy was 96%; trace-quality dropped from 0.91 to 0.78 after a prompt change. Root cause: agent started fabricating provider availability from prior context instead of calling the availability tool. Tool-grounding evaluator flagged it inside one PR. Final-answer eval would have caught it weeks later when fabrications started missing.
IT helpdesk runbook agent: redundancy score dropped from 0.95 to 0.72. Investigation showed the agent looping on search_kb when the first result didn't have an exact match. Final answer was usually correct (it eventually escalated), but p95 latency went from 6s to 22s. Detected before deploy.
Real-estate qualifier: constraint-satisfaction score regressed silently for two days. The agent was asking the right qualifying questions but ignoring the stated price-cap constraint when scoring leads. Final-answer eval used a generic rubric; the constraint evaluator was the only one that caught it. We now re-run the demo flow trace-eval before any prompt change to that agent.
A few things we got wrong before we got them right:
gpt-4o-2024-08-06, not gpt-4o. A floating judge alias means your historical scores are not comparable. We learned this the hard way when an OpenAI silent rev shifted our coherence scores by 4 points overnight.Not the raw chain-of-thought, no. OpenAI exposes a summary of o3/gpt-5 reasoning via the responses API. That summary plus the tool-call sequence plus the intermediate narration is what you evaluate. It's enough.
Yes, if you have a hybrid loop. The planner produces a structured plan you can score directly against a reference plan or rubric; the executor produces a tool-call trace you score with the pipeline above. Different failure modes, different evaluators.
Start with deterministic evaluators (redundancy, dead-end, tool-grounding) — they need no labels and catch a third of regressions on their own. Add LLM-as-judge evaluators next, calibrated against ~40 human-labeled traces. Don't try to build a perfect rubric on day one; iterate it like you'd iterate a system prompt.
Yes. The deterministic evaluators are model-agnostic. The judge evaluators work with any capable judge model (we've cross-checked with Claude and gotten ~91% agreement with the gpt-4o judge on coherence). The thing you lose with some non-OpenAI providers is the structured reasoning summary; you fall back to evaluating the visible narration.
Smoke subset on every PR (40 rows, ~2 minutes). Full suite on PRs touching the agent. Nightly run on main as a baseline. Same pattern as the continuous evaluation gate. Trace-eval is just another evaluator in the suite — once you have the pipeline, it's free to add to existing CI.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for multilingual customer support — a May 2026 comparison grounded in current model prices, benchm...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI