Why Single-Number Accuracy Fails Agents

A non-agentic LLM is essentially a single-output function: input goes in, output comes out, you grade it. Agents are paths through state space. Two agents can produce identical correct answers via wildly different trajectories — one cheap and reliable, one a 47-step disaster that happened to get there. Single-number accuracy hides this completely.

In 2026 the eval stacks that work measure four things at once: outcome, trajectory, tool use, and cost. This piece walks through what each one measures, the open-source frameworks that implement them, and the dashboards that actually get watched.

The Four-Dimensional Eval

flowchart TB
    Run[Agent Run] --> Out[Outcome: did it succeed?]
    Run --> Tr[Trajectory: was the path good?]
    Run --> Tu[Tool Use: were calls correct?]
    Run --> C[Cost: was it efficient?]
    Out --> Score[Composite Score]
    Tr --> Score
    Tu --> Score
    C --> Score
    Score --> Gate[Release Gate]

Outcome

Did the final state of the world (database row, email sent, code change) match the goal? This is the only fully objective metric. For deterministic tasks (SWE-bench, AppWorld, Tau-Bench) it is exact match or unit-test pass. For open-ended tasks, you need a stronger LLM judge with a rubric.

Trajectory Score

Was each step a reasonable continuation of the previous step? In 2026 the standard is Anthropic's trajectory rubric: an LLM judge scores each (state, action) pair on a 1-5 scale and the trajectory score is the geometric mean. Geometric mean punishes any single bad step, which is what you want — one obviously wrong step should not be averaged out by 19 fine ones.

Tool-Use Correctness

Three sub-metrics that most teams now track:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Selection accuracy — did the agent pick the right tool?
Argument correctness — were the arguments syntactically and semantically right?
Repetition rate — fraction of calls that duplicate a previous call's effect

Berkeley Function Calling Leaderboard V3 and Tau-Bench measure these directly with held-out test sets. For your own agent, you instrument every tool call and pipe the results to your eval harness.

Cost

Per-task dollar cost, p50/p95 latency, and token consumption. This is the metric most teams forget until the bill arrives. By 2026 the better eval frameworks (Braintrust, LangSmith, Inspect AI, Arize, Phoenix) emit cost as a first-class signal alongside outcome.

The Per-Step vs Per-Trajectory Question

A common mistake: scoring trajectories only at the end. If an agent run is 100 steps, ending evaluation only at the final answer means a 99-step disaster gets the same trajectory score as a 99-step beautiful path that happened to end equally. Score per-step. Aggregate at the trajectory level. Your dashboard should show both.

Eval Pipeline That Ships

sequenceDiagram
    participant CI as CI Pipeline
    participant H as Eval Harness
    participant A as Agent
    participant J as LLM Judge
    participant D as Dashboard
    CI->>H: trigger on PR
    H->>A: run task suite
    A->>H: trajectory + tool log
    H->>J: rubric-based grading
    J->>H: scores
    H->>D: emit metrics
    D->>CI: pass/fail gate

Three rules that make this stick in practice:

Determinism where you can get it: pin model versions, seed where supported, snapshot tool fixtures
Stratified test suites: split into unit, integration, regression, and adversarial — different gates for each
Cost in the gate: a PR that doubles cost should fail the gate even if outcome is unchanged

The Open-Source Stack in 2026

Inspect AI (UK AI Safety Institute) — sophisticated, frontier-grade rubric eval
Braintrust and LangSmith — managed eval + tracing
Phoenix (Arize) — open-source tracing with eval support
Promptfoo — lightweight, CI-friendly
DeepEval — Python-first, RAG-and-agent focused

What This Looks Like for a Voice Agent

CallSphere's healthcare voice agent runs through this stack on every model bump. The fixed-set tasks include "schedule appointment for new patient", "verify insurance for known patient", "handle reschedule with no available slot." Outcome is database state. Trajectory is judged. Tool-use accuracy is measured against ground-truth tool sequences. Cost includes both LLM and ASR/TTS minutes. A regression in any of the four dimensions blocks release.

Sources

Tau-Bench paper — https://arxiv.org/abs/2406.12045
Berkeley Function Calling Leaderboard — https://gorilla.cs.berkeley.edu/leaderboard.html
Inspect AI — https://inspect.ai-safety-institute.org.uk
LangSmith eval docs — https://docs.smith.langchain.com/evaluation
Anthropic trajectory rubric — https://www.anthropic.com/research

Agent Evaluation Beyond Accuracy: Trajectory, Tool-Use, and Cost Metrics — operator perspective

When teams move beyond agent Evaluation Beyond Accuracy, one question shows up first: where does the agent loop actually end? In practice, the boundary is rarely the model — it is the contract between the orchestrator and the tools it calls. What works in production looks unglamorous on paper — small specialized agents, explicit handoffs, deterministic retries, and dashboards that show you tool latency before they show you token spend.

Why this matters for AI voice + chat agents

Agentic AI in a real call center is a different beast than a single-LLM chatbot. Instead of one model answering one prompt, you orchestrate a small team: a router that decides intent, specialists that own a vertical (booking, intake, billing, escalation), and tools that read and write to the same Postgres your CRM trusts. Hand-offs are where most production bugs hide — when Agent A passes context to Agent B, anything that isn't explicit in the message gets lost, and the user feels it as the agent "forgetting." That's why the systems that hold up under load are the ones with typed tool schemas, deterministic state stored outside the conversation, and a hard ceiling on tool calls per session. The cost story is just as important: a multi-agent loop can quietly burn 10x the tokens of a single-LLM design if you let it think out loud at every step. The fix isn't a smarter model, it's smaller agents, shorter prompts, cached system messages, and evals that fail the build when p95 latency or per-session cost regresses. CallSphere runs this pattern across 6 verticals in production, and the rule has held every time: the agent you can debug in five minutes will out-survive the agent that's "smarter" on a benchmark.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQs

Q: What's the hardest part of running agent Evaluation Beyond Accuracy live?

A: Scaling comes from constraint, not capability. The deployments that hold up keep each agent narrow, cap tool calls per turn, cache the system prompt, and pin a smaller model for routing while reserving the larger model for synthesis. CallSphere's stack — 37 agents · 90+ tools · 115+ DB tables · 6 verticals live — is sized that way on purpose.

Q: How do you evaluate agent Evaluation Beyond Accuracy before shipping?

A: Hard ceilings beat heuristics. A maximum step count, an idempotency key on every tool call, and a fallback to a deterministic script when confidence drops below a threshold are what keep the loop bounded. Evals that simulate noisy inputs catch the rest before they reach a real caller.

Q: Which CallSphere verticals already rely on agent Evaluation Beyond Accuracy?

A: It's already in production. Today CallSphere runs this pattern in Salon and After-Hours Escalation, alongside the other live verticals (Healthcare, Real Estate, Salon, Sales, After-Hours Escalation, IT Helpdesk). The same orchestrator code path serves voice and chat — the difference is the tool set the router exposes.

See it live

Want to see healthcare agents handle real traffic? Spin up a walkthrough at https://healthcare.callsphere.tech or grab 20 minutes on the calendar: https://calendly.com/sagar-callsphere/new-meeting.

Agent Evaluation Beyond Accuracy: Trajectory, Tool-Use, and Cost Metrics

Why Single-Number Accuracy Fails Agents

The Four-Dimensional Eval

Outcome

Trajectory Score

Tool-Use Correctness

Cost

The Per-Step vs Per-Trajectory Question

Eval Pipeline That Ships

The Open-Source Stack in 2026

What This Looks Like for a Voice Agent

Sources

Agent Evaluation Beyond Accuracy: Trajectory, Tool-Use, and Cost Metrics — operator perspective

Why this matters for AI voice + chat agents

FAQs

See it live

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Self-hosted on-prem stack for Edge / on-device LLM inference: A May 2026 Comparison

Edge / on-device LLM inference in 2026: Open-source frontier matchup (DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3)

Product

Resources

Company

Legal

Industries

Integrations

Solutions

Compare

Pillar Guides

See AI Voice Agents in Action