Skip to content
Agentic AI
Agentic AI13 min read0 views

Evaluating Agent Reasoning Traces: Measuring Thought Quality Beyond Final Answers

Final-answer accuracy hides broken reasoning. Build an eval pipeline that scores the reasoning trace itself — coherence, faithfulness to tools, dead-end detection.

TL;DR

Final-answer-only evals are the agent equivalent of grading code by whether it compiles. You will catch the catastrophic failures and miss every silent class of bug — agents that arrive at the right answer through fabricated tool outputs, confused intermediate goals, or pure luck. As reasoning models like o3-2025-04-16 and gpt-5-2025-04-14 move into production agent loops, the trace itself becomes evaluable signal: a sequence of tool calls, intermediate assertions, and (where exposed) reasoning summaries you can score for coherence, faithfulness, redundancy, and tool-grounding. This post is the architecture for a trace-evaluation pipeline, a working LangSmith evaluate() example with a custom trace evaluator, and the rubric we run in production across the agents on CallSphere. It catches roughly 3× the regressions our final-answer evals miss alone.

The Failure Modes Final-Answer Evals Hide

Three categories of bug routinely sneak past final-answer-correctness checks:

1. Right answer, wrong reasoning. The agent fabricates an intermediate fact ("the patient's insurance is Aetna" when the tool never returned that), then happens to land on a correct final answer because the user's actual insurance was Aetna. Score on final answer: pass. Score on faithfulness: fail. The next session with a different insurance, the agent fabricates again and ships a wrong recommendation.

2. Dead-end loops. The agent calls tool A, gets a result, then calls tool A again with the same args, then again, then finally calls tool B and proceeds. Final answer correct, latency 4× target, cost 4× target. Final-answer eval: pass. Trace eval: large redundancy penalty.

3. Hallucinated tool outputs. Particularly common with chatty fast models in the executor role. The agent "remembers" a tool result that was never actually returned by the tool — it's confabulating from prior context. Caught only by tool-grounding evaluators that diff what the trace claims against what the tool actually emitted.

In our internal benchmarking across the agents serving our healthcare and IT-helpdesk verticals, trace-quality evaluators caught 3.1× as many real defects as final-answer-only evaluators on the same dataset. Most of the defects were silent: customers got plausible answers backed by faulty reasoning, and the bugs only surfaced when reasoning patterns drifted enough to occasionally produce wrong final answers too.

What "Trace Quality" Actually Means

Reasoning-trace evaluation is not the same as final-answer evaluation, and confusing the two leads to nonsensical metrics. Two orthogonal axes:

Axis Question What signals it
Final-answer correctness Did the agent give the user the right output? Reference answer match, factual judge, schema validation
Reasoning-trace faithfulness Did the agent get there through valid reasoning? Coherence, tool-grounding, redundancy, dead-end rate

You want both green. A trace that is internally coherent and tool-grounded but produces a wrong final answer means your tools or knowledge base are broken. A trace that is incoherent but produces a correct final answer means you got lucky and you'll regress unpredictably.

The Trace Eval Pipeline

flowchart TD
  A[Agent run completes] --> B[Extract trace from LangSmith]
  B --> C[Normalize: tool calls, results, reasoning summary]
  C --> D[Trace evaluator suite]
  D --> E1[Coherence judge]
  D --> E2[Tool-grounding check]
  D --> E3[Redundancy detector]
  D --> E4[Dead-end detector]
  D --> E5[Constraint-satisfaction judge]
  E1 --> F[Aggregate trace_score]
  E2 --> F
  E3 --> F
  E4 --> F
  E5 --> F
  F --> G{trace_score < threshold?}
  G -->|yes| H[Flag for human review]
  G -->|no| I[Pass]
  H --> J[Add to regression dataset]
  style F fill:#ffd
  style H fill:#fcc
  style I fill:#cfc

Figure 1 — The trace-eval pipeline. Each evaluator is independent and contributes to an aggregated trace_score; the dead-end and tool-grounding checks are deterministic, the rest are LLM-as-judge.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The pipeline is deliberately mixed: deterministic structural checks where we can afford them (cheap, low-noise) and judge-based checks where structure isn't enough (expressive, more expensive).

Extracting the Trace

For OpenAI reasoning models, you get three layers of signal:

  1. Tool-call sequence — the structured list of {tool, args, result} triples.
  2. Reasoning summary — for o3/o4-mini, OpenAI exposes a summary of the reasoning trace via the responses API (reasoning.summary). The full reasoning is not exposed; the summary is.
  3. Intermediate assistant messages — when the agent narrates its progress between tool calls.

LangSmith captures all three when you trace the agent end-to-end. Pull a run:

from langsmith import Client

client = Client()
run = client.read_run("a3f9-...-run-id", load_child_runs=True)

trace = []
for child in run.child_runs:
    if child.run_type == "tool":
        trace.append({
            "type": "tool",
            "name": child.name,
            "args": child.inputs,
            "result": child.outputs,
            "latency_ms": child.total_time * 1000,
        })
    elif child.run_type == "llm":
        msg = child.outputs.get("generations", [[{}]])[0][0]
        reasoning = msg.get("message", {}).get("reasoning_summary")
        trace.append({
            "type": "llm",
            "model": child.extra.get("metadata", {}).get("ls_model_name"),
            "reasoning_summary": reasoning,
            "content": msg.get("text"),
        })

That trace list — flat, ordered, with both tool I/O and reasoning summaries — is the input to every downstream evaluator.

The Five Trace Evaluators

Each one is a small, focused evaluator. Composability beats one mega-judge.

1. Coherence (LLM-as-judge)

Does the trace tell a story that makes sense? Each step's intent should follow from the previous step's outcome.

COHERENCE_PROMPT = """You are evaluating the COHERENCE of an agent's reasoning trace.

Rules:
- Each step's intent must logically follow from the prior step's outcome.
- Goal shifts mid-trace WITHOUT a triggering observation are incoherent.
- Mark coherent (1.0), partially coherent (0.5), incoherent (0.0).
- Output JSON: {score: float, rationale: str}"""

def coherence_evaluator(trace: list[dict]) -> dict:
    resp = judge.chat.completions.create(
        model="gpt-4o-2024-08-06",  # PIN the judge model
        messages=[
            {"role": "system", "content": COHERENCE_PROMPT},
            {"role": "user", "content": json.dumps(trace, indent=2)},
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)

2. Tool-grounding (deterministic)

Does every factual claim the agent makes trace back to a tool result? This is structural — we extract claims from the agent's narration and check whether each appears in some prior tool's output.

def tool_grounding_evaluator(trace: list[dict]) -> dict:
    tool_outputs = [step["result"] for step in trace if step["type"] == "tool"]
    pool = json.dumps(tool_outputs).lower()

    # Use a small LLM to extract factual claims from the agent's narration
    claims = extract_claims([s for s in trace if s["type"] == "llm"])

    grounded = sum(1 for c in claims if claim_supported(c, pool))
    score = grounded / max(len(claims), 1)
    return {
        "score": score,
        "rationale": f"{grounded}/{len(claims)} claims grounded in tool output",
    }

claim_supported is fuzzy — substring + small-LLM entailment check. The dollar amount, date, name, or ID in the claim must appear (or be entailed) somewhere in the tool output pool.

3. Redundancy detector (deterministic)

Counts repeated tool calls with identical or near-identical args. Three calls to search_appointments with the same date is almost always a dead-end loop.

def redundancy_evaluator(trace: list[dict]) -> dict:
    calls = [(s["name"], json.dumps(s["args"], sort_keys=True))
             for s in trace if s["type"] == "tool"]
    dups = len(calls) - len(set(calls))
    score = max(0.0, 1.0 - (dups / max(len(calls), 1)))
    return {"score": score, "rationale": f"{dups} duplicate tool calls"}

4. Dead-end detector (deterministic)

A dead-end is a tool call whose result is never used. We trace data flow: each tool result must either feed a subsequent tool's args or appear in the final answer.

5. Constraint satisfaction (LLM-as-judge)

For planner-driven agents (see the hybrid reasoning architecture), we extract the user's stated constraints and verify each was respected by the trace. "Aetna insurance" stated as a constraint → at least one tool call in the trace must have filtered or verified for Aetna.

Wiring It Into LangSmith evaluate()

The whole point is to run these as part of the same eval loop you already have for final answers.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

from langsmith import evaluate, Client
from langsmith.schemas import Example, Run

def trace_quality_evaluator(run: Run, example: Example) -> dict:
    trace = normalize_trace(run)  # the extraction shown above
    scores = {
        "coherence":      coherence_evaluator(trace)["score"],
        "tool_grounding": tool_grounding_evaluator(trace)["score"],
        "redundancy":     redundancy_evaluator(trace)["score"],
        "dead_end":       dead_end_evaluator(trace)["score"],
        "constraints":    constraint_evaluator(trace, example)["score"],
    }
    # Weighted aggregate. Tool-grounding is non-negotiable.
    weights = {"coherence": 0.20, "tool_grounding": 0.30,
               "redundancy": 0.15, "dead_end": 0.15, "constraints": 0.20}
    aggregate = sum(scores[k] * weights[k] for k in weights)
    return {
        "key": "trace_quality",
        "score": aggregate,
        "comment": json.dumps(scores),
    }

def final_answer_evaluator(run: Run, example: Example) -> dict:
    # Standard rubric/exact-match against example.outputs
    ...

results = evaluate(
    lambda inp: build_agent().invoke(inp),
    data="agent-regression-suite",
    evaluators=[trace_quality_evaluator, final_answer_evaluator],
    experiment_prefix="trace-eval-2026-05-06",
    metadata={"judge": "gpt-4o-2024-08-06", "agent_planner": "o3-2025-04-16"},
    max_concurrency=8,
)

The evaluator returns a single trace_quality score plus a JSON comment with the per-dimension breakdown, so the LangSmith UI shows both the rolled-up number and the diagnosis when something regresses.

The Rubric Table We Actually Score Against

This is the rubric we hand to the judge model and re-use for human calibration:

Dimension 1.0 (pass) 0.5 (partial) 0.0 (fail)
Coherence Every step follows from prior outcome One unjustified goal shift Multiple unjustified shifts or contradictory steps
Tool-grounding All factual claims supported by a tool result One unsupported claim Multiple fabricated facts
Redundancy Zero duplicate tool calls 1–2 duplicates 3+ duplicates or visible loop
Dead-end Every tool result used downstream One unused result Multiple unused results
Constraint satisfaction All stated constraints respected One soft constraint violated A hard constraint (insurance, time, identity) violated

We re-calibrate the judge against human labels quarterly on a 60-row sample. Last calibration the judge agreed with humans 87% on coherence, 94% on tool-grounding (deterministic helps), 91% on constraints. Below 80% agreement we retire the judge prompt and rewrite it.

Catching the Silent Bugs

Three real examples from our last six months:

  1. Healthcare scheduler: final-answer accuracy was 96%; trace-quality dropped from 0.91 to 0.78 after a prompt change. Root cause: agent started fabricating provider availability from prior context instead of calling the availability tool. Tool-grounding evaluator flagged it inside one PR. Final-answer eval would have caught it weeks later when fabrications started missing.

  2. IT helpdesk runbook agent: redundancy score dropped from 0.95 to 0.72. Investigation showed the agent looping on search_kb when the first result didn't have an exact match. Final answer was usually correct (it eventually escalated), but p95 latency went from 6s to 22s. Detected before deploy.

  3. Real-estate qualifier: constraint-satisfaction score regressed silently for two days. The agent was asking the right qualifying questions but ignoring the stated price-cap constraint when scoring leads. Final-answer eval used a generic rubric; the constraint evaluator was the only one that caught it. We now re-run the demo flow trace-eval before any prompt change to that agent.

Operational Notes

A few things we got wrong before we got them right:

  • Don't aggregate trace_quality and final_answer into one score. They diagnose different things and should gate independently. We tried a single weighted score for two months and it hid regressions where one axis went up and the other went down.
  • Pin the judge model with a date. gpt-4o-2024-08-06, not gpt-4o. A floating judge alias means your historical scores are not comparable. We learned this the hard way when an OpenAI silent rev shifted our coherence scores by 4 points overnight.
  • Cache reasoning-summary extraction. o3 reasoning summaries are not free to re-fetch and the trace is immutable once the run completes. We cache them keyed on run_id.
  • The redundancy evaluator catches more than redundancy. It's a great canary for "the executor model just got dumber" because dumber executors loop more. We weight it at 0.15 in the aggregate but treat it as a leading indicator beyond that weight.
  • Trace-eval cost is real. Roughly 1.4× our final-answer eval cost because of the per-dimension judge calls. Worth it; cheaper than missing regressions for a week.

Frequently Asked Questions

Do reasoning models expose their full reasoning trace?

Not the raw chain-of-thought, no. OpenAI exposes a summary of o3/gpt-5 reasoning via the responses API. That summary plus the tool-call sequence plus the intermediate narration is what you evaluate. It's enough.

Should I evaluate the planner trace and the executor trace separately?

Yes, if you have a hybrid loop. The planner produces a structured plan you can score directly against a reference plan or rubric; the executor produces a tool-call trace you score with the pipeline above. Different failure modes, different evaluators.

How do I bootstrap the rubric without a labeled dataset?

Start with deterministic evaluators (redundancy, dead-end, tool-grounding) — they need no labels and catch a third of regressions on their own. Add LLM-as-judge evaluators next, calibrated against ~40 human-labeled traces. Don't try to build a perfect rubric on day one; iterate it like you'd iterate a system prompt.

What about non-OpenAI models — can I do this with Claude or Gemini agents?

Yes. The deterministic evaluators are model-agnostic. The judge evaluators work with any capable judge model (we've cross-checked with Claude and gotten ~91% agreement with the gpt-4o judge on coherence). The thing you lose with some non-OpenAI providers is the structured reasoning summary; you fall back to evaluating the visible narration.

Where should I put trace-eval in the dev loop — on every PR?

Smoke subset on every PR (40 rows, ~2 minutes). Full suite on PRs touching the agent. Nightly run on main as a baseline. Same pattern as the continuous evaluation gate. Trace-eval is just another evaluator in the suite — once you have the pipeline, it's free to add to existing CI.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.