By Sagar Shankaran, Founder of CallSphere
A 'did the agent answer correctly?' pass/fail hides broken tool calls, wasted tokens, and silent retries. Here is how to evaluate intermediate steps.
Key takeaways
If you only grade the final answer of a multi-step, tool-using agent, you are flying blind on roughly 60–80% of the actual behavior. The agent can hit the right answer for the wrong reason, retry a broken tool four times silently, hallucinate intermediate JSON that the next step ignores, or burn 18,000 tokens to do what should have cost 2,000. End-to-end pass/fail tells you none of this.
The fix is trajectory evaluation — grading the sequence of steps the agent actually took, not just the last token it emitted. In this post I walk through what trajectory evaluators are, the four failure modes only intermediate-step scoring catches, and how we wire them up in LangSmith for the voice and chat agents that power CallSphere.
Here is the canonical eval most teams run on day one:
input -> agent.invoke(input) -> output
score = LLM_judge(output, expected)
It is fast, it is cheap, it produces a single number, and it is deeply misleading the moment your agent calls more than one tool.
I have seen all four of these in production traces in the last 90 days:
get_appointment_by_id. It called search_appointments with a fuzzy query, got 47 rows back, and the LLM picked the right one by guessing the patient's name. Pass/fail: pass. Reality: the agent has no idea how to use the API and will fail on common names.{"status": "confirmed", "id": "appt_8821"} from a reasoning step, and the next step happily passed appt_8821 to a downstream tool — which 404'd, was caught, retried with a different made-up ID, and finally succeeded by accident.If your only signal is final answer correctness, none of these show up until your unit economics or your latency SLO breaks.
A trajectory is the ordered list of steps an agent takes between the input and the final answer. For an LLM agent that means:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
In LangSmith terms, the trajectory is the trace tree — the parent run plus every nested child run. Trajectory evaluation grades that tree, not just the root output.
The LangSmith evaluation docs define three orthogonal evaluator families that operate on this tree:
| Evaluator type | What it grades | Catches |
|---|---|---|
| Final-output | run.outputs only |
Is the answer right? |
| Trajectory | Ordered list of tool calls in run.child_runs |
Did the agent take a sensible path? |
| Intermediate-step | A single nested run's inputs/outputs | Was that one tool call correct? |
You want all three. Skipping the middle two is what makes end-to-end metrics lie.
flowchart LR
Q[User question] --> P{Planner LLM}
P -->|tool_call| T1[search_appointments]
T1 -->|47 rows| P2{Filter LLM}
P2 -->|tool_call| T2[get_appointment_by_id]
T2 -->|appt_8821| P3{Confirmer LLM}
P3 -->|final| A[Answer]
E1[[Eval: tool_choice_correct]] -.checks.-> P
E2[[Eval: arg_schema_valid]] -.checks.-> T1
E3[[Eval: no_redundant_calls]] -.checks.-> P2
E4[[Eval: cost_under_budget]] -.checks.-> A
E5[[Eval: final_answer_grounded]] -.checks.-> A
Each E-node is a separate evaluator. They run on the same trace but answer different questions:
search_appointments or get_appointment_by_id first? This catches "right answer, wrong tool."appointment_id should be a UUID and the agent passed "the morning one", fail it. This catches hallucinated intermediate JSON before it propagates.total_tokens and total_cost per the LangSmith observability docs. This catches the 12× blow-up.Here is the trajectory evaluator pattern we use for the appointment-booking agent running in our healthcare deployments. It runs on every PR via CI, and on a 1% sample of production traces continuously.
# evaluators/trajectory.py
from langsmith.evaluation import evaluate, EvaluationResult
from langsmith.schemas import Run, Example
EXPECTED_TOOLS_BY_INTENT = {
"lookup_appointment": ["get_appointment_by_id"],
"reschedule": ["get_appointment_by_id", "update_appointment"],
"cancel": ["get_appointment_by_id", "cancel_appointment"],
}
def tool_choice_correct(run: Run, example: Example) -> EvaluationResult:
"""Did the agent call the right tools, in roughly the right order?"""
intent = example.inputs["intent"]
expected = EXPECTED_TOOLS_BY_INTENT[intent]
actual = [
child.name
for child in (run.child_runs or [])
if child.run_type == "tool"
]
# Order-aware: every expected tool must appear, in order,
# but extras are allowed (we score those separately).
i = 0
for tool in actual:
if i < len(expected) and tool == expected[i]:
i += 1
score = 1.0 if i == len(expected) else 0.0
return EvaluationResult(
key="tool_choice_correct",
score=score,
comment=f"expected={expected} actual={actual}",
)
def no_redundant_calls(run: Run, example: Example) -> EvaluationResult:
"""Penalize duplicate tool calls with identical args."""
seen = set()
dupes = 0
for child in (run.child_runs or []):
if child.run_type != "tool":
continue
key = (child.name, str(sorted((child.inputs or {}).items())))
if key in seen:
dupes += 1
seen.add(key)
return EvaluationResult(
key="no_redundant_calls",
score=1.0 if dupes == 0 else 0.0,
comment=f"{dupes} duplicate tool calls",
)
A few things worth noting:
run.child_runs recursively in production code; the snippet flattens for clarity. LangGraph subgraphs nest, so a flat scan misses the deep tool calls.tool_choice_correct uses subsequence matching, not exact equality. Real agents add steps a script-writer didn't anticipate (a clarification question, a retry after a 429), and we don't want to penalize that.comment is the field that shows up in the LangSmith UI. Spending 30 seconds on these strings pays back tenfold during triage.Sometimes you want to grade a single node in the tree — for example, "did the planner pick valid arguments for get_appointment_by_id?" That is an intermediate-step evaluator:
# evaluators/intermediate.py
import re
from langsmith.evaluation import EvaluationResult
from langsmith.schemas import Run
UUID_RE = re.compile(
r"^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"
)
def appointment_id_well_formed(run: Run) -> EvaluationResult:
"""Walk the tree, find every get_appointment_by_id call,
fail if any arg is not a UUID."""
bad = []
def walk(r: Run):
if r.name == "get_appointment_by_id":
appt_id = (r.inputs or {}).get("appointment_id", "")
if not UUID_RE.match(str(appt_id)):
bad.append(appt_id)
for c in (r.child_runs or []):
walk(c)
walk(run)
return EvaluationResult(
key="appointment_id_well_formed",
score=1.0 if not bad else 0.0,
comment=f"bad ids: {bad}" if bad else "ok",
)
This is the evaluator that catches the hallucinated-JSON failure mode. We run it on 100% of evals because it is essentially free — pure regex over trace metadata.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
# run_evals.py
from langsmith.evaluation import evaluate
from evaluators.trajectory import tool_choice_correct, no_redundant_calls
from evaluators.intermediate import appointment_id_well_formed
def my_agent(inputs: dict) -> dict:
# Your real agent.invoke goes here.
return graph.invoke({"input": inputs["question"]})
evaluate(
my_agent,
data="appointment-agent-eval-v3", # LangSmith dataset
evaluators=[
tool_choice_correct,
no_redundant_calls,
appointment_id_well_formed,
],
experiment_prefix="trajectory-suite",
max_concurrency=8,
)
Last sprint we shipped a planner change to the cancellation flow. End-to-end accuracy moved from 94.1% to 94.6% — within noise. Procurement would have shipped it. The trajectory evals told a different story:
| Metric | Before | After |
|---|---|---|
| Final-answer accuracy | 94.1% | 94.6% |
| tool_choice_correct | 91.7% | 78.2% |
| no_redundant_calls | 96.4% | 88.9% |
| Mean tool calls / turn | 2.3 | 3.4 |
| p95 latency | 2.1s | 3.6s |
| Mean cost / turn | $0.014 | $0.022 |
Same final accuracy, 57% more cost, 71% more p95 latency, and a planner that no longer chooses tools correctly. End-to-end eval said ship; trajectory eval said roll back. We rolled back.
evaluate_everything function returning a composite score is impossible to debug. One evaluator per failure mode, one score each. Composite views belong in the dashboard.Every voice and chat agent shipped on CallSphere — healthcare booking, real estate qualification, after-hours escalation, IT helpdesk, salon, sales — runs trajectory evaluators on every release. We block deploys when tool_choice_correct regresses by more than 2 percentage points or when no_redundant_calls drops below 90%. The result: median tool calls per turn dropped 41% over six months, and p95 latency on the realtime voice path stays under 1.0s even as we add more tools per agent.
If you are building agents with similar topology, the agent eval glossary on our site has runnable patterns for trajectory, intermediate-step, and cost-aware evaluators.
Q1: How is trajectory evaluation different from observability? Observability is seeing what happened. Trajectory evaluation is grading what happened against expected behavior. You need both — observability tells you the agent took 9 steps, trajectory eval tells you whether 9 was the right number.
Q2: Should I run trajectory evals in CI or in production? Both. CI gates regressions on a curated dataset before deploy. Production runs the same evaluators on a 1–5% sample of live traces so you catch drift, distribution shift, and bugs your dataset never imagined.
Q3: How big should my trajectory eval dataset be? For a single-purpose agent: 50–200 examples covering the top intents and the top three failure modes per intent. The marginal value drops fast past 200; the marginal value of adding new intents never drops.
Q4: What about agents that legitimately take different paths to the same answer? Use subsequence matching (every required tool appears in order, extras allowed) instead of exact-sequence matching, and pair it with a "minimum tool calls" metric. Two valid paths is fine; eight valid paths usually means your prompt is underspecified.
Q5: Do I need LangSmith specifically?
No — Langfuse, Arize Phoenix, Braintrust, and homegrown OpenTelemetry pipelines all expose the trace tree. The pattern (final-output + trajectory + intermediate-step evaluators) is what matters. We use LangSmith because the SDK ergonomics for run.child_runs and evaluate() are the cleanest in the category.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
GPT-Realtime-2 brings GPT-5-class reasoning into voice. What that means for tool-call reliability, structured output, and production agent design.
A clean before/after of agent architecture in 2026. The control loop moved from your framework code into the model's reasoning chain. What that looks like.
Google's May 2026 MCP 1.0 + A2A developers guide is the cleanest protocol picker we have seen. The takeaways, in plain English, with a CallSphere lens.
Workspace Studio puts a Gemini-powered AI agent builder inside Google Workspace. A walkthrough of what it does, who it is for, and where it fits in 2026.
Gemini 3.1 Ultra ships with a 2-million token context window and full text, image, audio, and video multimodality. What changes and how to build for it.
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.