Skip to content
Agentic AI
Agentic AI13 min read0 views

Evaluating Multi-Step Tool-Using Agents: Why End-to-End Metrics Lie

A 'did the agent answer correctly?' pass/fail hides broken tool calls, wasted tokens, and silent retries. Here is how to evaluate intermediate steps.

TL;DR

If you only grade the final answer of a multi-step, tool-using agent, you are flying blind on roughly 60–80% of the actual behavior. The agent can hit the right answer for the wrong reason, retry a broken tool four times silently, hallucinate intermediate JSON that the next step ignores, or burn 18,000 tokens to do what should have cost 2,000. End-to-end pass/fail tells you none of this.

The fix is trajectory evaluation — grading the sequence of steps the agent actually took, not just the last token it emitted. In this post I walk through what trajectory evaluators are, the four failure modes only intermediate-step scoring catches, and how we wire them up in LangSmith for the voice and chat agents that power CallSphere.

Why End-to-End Metrics Lie

Here is the canonical eval most teams run on day one:

input  -> agent.invoke(input) -> output
score  = LLM_judge(output, expected)

It is fast, it is cheap, it produces a single number, and it is deeply misleading the moment your agent calls more than one tool.

I have seen all four of these in production traces in the last 90 days:

  1. Right answer, wrong tool. The agent was supposed to call get_appointment_by_id. It called search_appointments with a fuzzy query, got 47 rows back, and the LLM picked the right one by guessing the patient's name. Pass/fail: pass. Reality: the agent has no idea how to use the API and will fail on common names.
  2. Silent retry storms. A flaky webhook returned 500 three times. The agent retried, eventually succeeded, and answered correctly. End-to-end metric: pass. Trajectory: 3 wasted tool calls, 4× the latency, and a real chance the user hung up before the answer arrived.
  3. Hallucinated intermediate JSON. The agent emitted {"status": "confirmed", "id": "appt_8821"} from a reasoning step, and the next step happily passed appt_8821 to a downstream tool — which 404'd, was caught, retried with a different made-up ID, and finally succeeded by accident.
  4. Token blow-up. Same final answer, but one run cost 1,800 tokens and another cost 22,400 because the agent re-summarized the entire conversation history at every step. Output eval: identical. Cost: 12×.

If your only signal is final answer correctness, none of these show up until your unit economics or your latency SLO breaks.

What "Trajectory" Actually Means

A trajectory is the ordered list of steps an agent takes between the input and the final answer. For an LLM agent that means:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • Each LLM call (with its messages and tool-call arguments)
  • Each tool invocation (with input args and return value)
  • Each retry, each fallback, each subgraph hop
  • Token usage and wall-clock latency at every node

In LangSmith terms, the trajectory is the trace tree — the parent run plus every nested child run. Trajectory evaluation grades that tree, not just the root output.

The LangSmith evaluation docs define three orthogonal evaluator families that operate on this tree:

Evaluator type What it grades Catches
Final-output run.outputs only Is the answer right?
Trajectory Ordered list of tool calls in run.child_runs Did the agent take a sensible path?
Intermediate-step A single nested run's inputs/outputs Was that one tool call correct?

You want all three. Skipping the middle two is what makes end-to-end metrics lie.

The Four Failure Modes Trajectory Evals Catch

flowchart LR
  Q[User question] --> P{Planner LLM}
  P -->|tool_call| T1[search_appointments]
  T1 -->|47 rows| P2{Filter LLM}
  P2 -->|tool_call| T2[get_appointment_by_id]
  T2 -->|appt_8821| P3{Confirmer LLM}
  P3 -->|final| A[Answer]

  E1[[Eval: tool_choice_correct]] -.checks.-> P
  E2[[Eval: arg_schema_valid]] -.checks.-> T1
  E3[[Eval: no_redundant_calls]] -.checks.-> P2
  E4[[Eval: cost_under_budget]] -.checks.-> A
  E5[[Eval: final_answer_grounded]] -.checks.-> A

Each E-node is a separate evaluator. They run on the same trace but answer different questions:

  • E1 — tool_choice_correct. Given the user input, would a senior engineer have called search_appointments or get_appointment_by_id first? This catches "right answer, wrong tool."
  • E2 — arg_schema_valid. Did the args the agent emitted match the tool's JSON schema? If appointment_id should be a UUID and the agent passed "the morning one", fail it. This catches hallucinated intermediate JSON before it propagates.
  • E3 — no_redundant_calls. Did the agent call the same tool with the same args twice in one turn? This catches silent retries and confused planner loops.
  • E4 — cost_under_budget. Did this trace exceed the per-turn token budget? Trace metadata exposes total_tokens and total_cost per the LangSmith observability docs. This catches the 12× blow-up.
  • E5 — final_answer_grounded. The classic LLM-as-judge: is the answer supported by the tool outputs that actually returned data?

Wiring It Up in LangSmith

Here is the trajectory evaluator pattern we use for the appointment-booking agent running in our healthcare deployments. It runs on every PR via CI, and on a 1% sample of production traces continuously.

# evaluators/trajectory.py
from langsmith.evaluation import evaluate, EvaluationResult
from langsmith.schemas import Run, Example

EXPECTED_TOOLS_BY_INTENT = {
    "lookup_appointment": ["get_appointment_by_id"],
    "reschedule":         ["get_appointment_by_id", "update_appointment"],
    "cancel":             ["get_appointment_by_id", "cancel_appointment"],
}

def tool_choice_correct(run: Run, example: Example) -> EvaluationResult:
    """Did the agent call the right tools, in roughly the right order?"""
    intent = example.inputs["intent"]
    expected = EXPECTED_TOOLS_BY_INTENT[intent]

    actual = [
        child.name
        for child in (run.child_runs or [])
        if child.run_type == "tool"
    ]

    # Order-aware: every expected tool must appear, in order,
    # but extras are allowed (we score those separately).
    i = 0
    for tool in actual:
        if i < len(expected) and tool == expected[i]:
            i += 1
    score = 1.0 if i == len(expected) else 0.0

    return EvaluationResult(
        key="tool_choice_correct",
        score=score,
        comment=f"expected={expected} actual={actual}",
    )

def no_redundant_calls(run: Run, example: Example) -> EvaluationResult:
    """Penalize duplicate tool calls with identical args."""
    seen = set()
    dupes = 0
    for child in (run.child_runs or []):
        if child.run_type != "tool":
            continue
        key = (child.name, str(sorted((child.inputs or {}).items())))
        if key in seen:
            dupes += 1
        seen.add(key)

    return EvaluationResult(
        key="no_redundant_calls",
        score=1.0 if dupes == 0 else 0.0,
        comment=f"{dupes} duplicate tool calls",
    )

A few things worth noting:

  • We walk run.child_runs recursively in production code; the snippet flattens for clarity. LangGraph subgraphs nest, so a flat scan misses the deep tool calls.
  • tool_choice_correct uses subsequence matching, not exact equality. Real agents add steps a script-writer didn't anticipate (a clarification question, a retry after a 429), and we don't want to penalize that.
  • comment is the field that shows up in the LangSmith UI. Spending 30 seconds on these strings pays back tenfold during triage.

Intermediate-Step Evaluators on a Single Tool Call

Sometimes you want to grade a single node in the tree — for example, "did the planner pick valid arguments for get_appointment_by_id?" That is an intermediate-step evaluator:

# evaluators/intermediate.py
import re
from langsmith.evaluation import EvaluationResult
from langsmith.schemas import Run

UUID_RE = re.compile(
    r"^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"
)

def appointment_id_well_formed(run: Run) -> EvaluationResult:
    """Walk the tree, find every get_appointment_by_id call,
    fail if any arg is not a UUID."""
    bad = []
    def walk(r: Run):
        if r.name == "get_appointment_by_id":
            appt_id = (r.inputs or {}).get("appointment_id", "")
            if not UUID_RE.match(str(appt_id)):
                bad.append(appt_id)
        for c in (r.child_runs or []):
            walk(c)
    walk(run)

    return EvaluationResult(
        key="appointment_id_well_formed",
        score=1.0 if not bad else 0.0,
        comment=f"bad ids: {bad}" if bad else "ok",
    )

This is the evaluator that catches the hallucinated-JSON failure mode. We run it on 100% of evals because it is essentially free — pure regex over trace metadata.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Running the Suite

# run_evals.py
from langsmith.evaluation import evaluate
from evaluators.trajectory import tool_choice_correct, no_redundant_calls
from evaluators.intermediate import appointment_id_well_formed

def my_agent(inputs: dict) -> dict:
    # Your real agent.invoke goes here.
    return graph.invoke({"input": inputs["question"]})

evaluate(
    my_agent,
    data="appointment-agent-eval-v3",   # LangSmith dataset
    evaluators=[
        tool_choice_correct,
        no_redundant_calls,
        appointment_id_well_formed,
    ],
    experiment_prefix="trajectory-suite",
    max_concurrency=8,
)

Real Numbers From a Recent Release

Last sprint we shipped a planner change to the cancellation flow. End-to-end accuracy moved from 94.1% to 94.6% — within noise. Procurement would have shipped it. The trajectory evals told a different story:

Metric Before After
Final-answer accuracy 94.1% 94.6%
tool_choice_correct 91.7% 78.2%
no_redundant_calls 96.4% 88.9%
Mean tool calls / turn 2.3 3.4
p95 latency 2.1s 3.6s
Mean cost / turn $0.014 $0.022

Same final accuracy, 57% more cost, 71% more p95 latency, and a planner that no longer chooses tools correctly. End-to-end eval said ship; trajectory eval said roll back. We rolled back.

Common Anti-Patterns to Avoid

  • Treating every retry as a failure. Some retries are correct behavior — exponential backoff on a transient 503 is not a bug. Score the unrecoverable retries (same args twice, or N>3) and ignore the rest.
  • LLM-as-judge on intermediate steps. Tempting, but slow, expensive, and noisy. Use deterministic checks (schema validation, tool-name comparison, set membership) for intermediate steps and reserve LLM-judges for the final output.
  • One giant evaluator. A single evaluate_everything function returning a composite score is impossible to debug. One evaluator per failure mode, one score each. Composite views belong in the dashboard.
  • Skipping the dataset. Trajectory evaluators are useless without a dataset that has known-good trajectories or at least known-good intents. Build the dataset first, then the evaluator.

How CallSphere Uses This

Every voice and chat agent shipped on CallSphere — healthcare booking, real estate qualification, after-hours escalation, IT helpdesk, salon, sales — runs trajectory evaluators on every release. We block deploys when tool_choice_correct regresses by more than 2 percentage points or when no_redundant_calls drops below 90%. The result: median tool calls per turn dropped 41% over six months, and p95 latency on the realtime voice path stays under 1.0s even as we add more tools per agent.

If you are building agents with similar topology, the agent eval glossary on our site has runnable patterns for trajectory, intermediate-step, and cost-aware evaluators.

FAQ

Q1: How is trajectory evaluation different from observability? Observability is seeing what happened. Trajectory evaluation is grading what happened against expected behavior. You need both — observability tells you the agent took 9 steps, trajectory eval tells you whether 9 was the right number.

Q2: Should I run trajectory evals in CI or in production? Both. CI gates regressions on a curated dataset before deploy. Production runs the same evaluators on a 1–5% sample of live traces so you catch drift, distribution shift, and bugs your dataset never imagined.

Q3: How big should my trajectory eval dataset be? For a single-purpose agent: 50–200 examples covering the top intents and the top three failure modes per intent. The marginal value drops fast past 200; the marginal value of adding new intents never drops.

Q4: What about agents that legitimately take different paths to the same answer? Use subsequence matching (every required tool appears in order, extras allowed) instead of exact-sequence matching, and pair it with a "minimum tool calls" metric. Two valid paths is fine; eight valid paths usually means your prompt is underspecified.

Q5: Do I need LangSmith specifically? No — Langfuse, Arize Phoenix, Braintrust, and homegrown OpenTelemetry pipelines all expose the trace tree. The pattern (final-output + trajectory + intermediate-step evaluators) is what matters. We use LangSmith because the SDK ergonomics for run.child_runs and evaluate() are the cleanest in the category.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.