By Sagar Shankaran, Founder of CallSphere
Accuracy alone misses what is actually wrong with your agent. The 2026 metrics teams use to evaluate agentic systems before and after deployment.
Key takeaways
A non-agentic LLM is essentially a single-output function: input goes in, output comes out, you grade it. Agents are paths through state space. Two agents can produce identical correct answers via wildly different trajectories — one cheap and reliable, one a 47-step disaster that happened to get there. Single-number accuracy hides this completely.
In 2026 the eval stacks that work measure four things at once: outcome, trajectory, tool use, and cost. This piece walks through what each one measures, the open-source frameworks that implement them, and the dashboards that actually get watched.
flowchart TB
Run[Agent Run] --> Out[Outcome: did it succeed?]
Run --> Tr[Trajectory: was the path good?]
Run --> Tu[Tool Use: were calls correct?]
Run --> C[Cost: was it efficient?]
Out --> Score[Composite Score]
Tr --> Score
Tu --> Score
C --> Score
Score --> Gate[Release Gate]
Did the final state of the world (database row, email sent, code change) match the goal? This is the only fully objective metric. For deterministic tasks (SWE-bench, AppWorld, Tau-Bench) it is exact match or unit-test pass. For open-ended tasks, you need a stronger LLM judge with a rubric.
Was each step a reasonable continuation of the previous step? In 2026 the standard is Anthropic's trajectory rubric: an LLM judge scores each (state, action) pair on a 1-5 scale and the trajectory score is the geometric mean. Geometric mean punishes any single bad step, which is what you want — one obviously wrong step should not be averaged out by 19 fine ones.
Three sub-metrics that most teams now track:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Berkeley Function Calling Leaderboard V3 and Tau-Bench measure these directly with held-out test sets. For your own agent, you instrument every tool call and pipe the results to your eval harness.
Per-task dollar cost, p50/p95 latency, and token consumption. This is the metric most teams forget until the bill arrives. By 2026 the better eval frameworks (Braintrust, LangSmith, Inspect AI, Arize, Phoenix) emit cost as a first-class signal alongside outcome.
A common mistake: scoring trajectories only at the end. If an agent run is 100 steps, ending evaluation only at the final answer means a 99-step disaster gets the same trajectory score as a 99-step beautiful path that happened to end equally. Score per-step. Aggregate at the trajectory level. Your dashboard should show both.
sequenceDiagram
participant CI as CI Pipeline
participant H as Eval Harness
participant A as Agent
participant J as LLM Judge
participant D as Dashboard
CI->>H: trigger on PR
H->>A: run task suite
A->>H: trajectory + tool log
H->>J: rubric-based grading
J->>H: scores
H->>D: emit metrics
D->>CI: pass/fail gate
Three rules that make this stick in practice:
CallSphere's healthcare voice agent runs through this stack on every model bump. The fixed-set tasks include "schedule appointment for new patient", "verify insurance for known patient", "handle reschedule with no available slot." Outcome is database state. Trajectory is judged. Tool-use accuracy is measured against ground-truth tool sequences. Cost includes both LLM and ASR/TTS minutes. A regression in any of the four dimensions blocks release.
When teams move beyond agent Evaluation Beyond Accuracy, one question shows up first: where does the agent loop actually end? In practice, the boundary is rarely the model — it is the contract between the orchestrator and the tools it calls. What works in production looks unglamorous on paper — small specialized agents, explicit handoffs, deterministic retries, and dashboards that show you tool latency before they show you token spend.
Agentic AI in a real call center is a different beast than a single-LLM chatbot. Instead of one model answering one prompt, you orchestrate a small team: a router that decides intent, specialists that own a vertical (booking, intake, billing, escalation), and tools that read and write to the same Postgres your CRM trusts. Hand-offs are where most production bugs hide — when Agent A passes context to Agent B, anything that isn't explicit in the message gets lost, and the user feels it as the agent "forgetting." That's why the systems that hold up under load are the ones with typed tool schemas, deterministic state stored outside the conversation, and a hard ceiling on tool calls per session. The cost story is just as important: a multi-agent loop can quietly burn 10x the tokens of a single-LLM design if you let it think out loud at every step. The fix isn't a smarter model, it's smaller agents, shorter prompts, cached system messages, and evals that fail the build when p95 latency or per-session cost regresses. CallSphere runs this pattern across 6 verticals in production, and the rule has held every time: the agent you can debug in five minutes will out-survive the agent that's "smarter" on a benchmark.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Q: What's the hardest part of running agent Evaluation Beyond Accuracy live?
A: Scaling comes from constraint, not capability. The deployments that hold up keep each agent narrow, cap tool calls per turn, cache the system prompt, and pin a smaller model for routing while reserving the larger model for synthesis. CallSphere's stack — 37 agents · 90+ tools · 115+ DB tables · 6 verticals live — is sized that way on purpose.
Q: How do you evaluate agent Evaluation Beyond Accuracy before shipping?
A: Hard ceilings beat heuristics. A maximum step count, an idempotency key on every tool call, and a fallback to a deterministic script when confidence drops below a threshold are what keep the loop bounded. Evals that simulate noisy inputs catch the rest before they reach a real caller.
Q: Which CallSphere verticals already rely on agent Evaluation Beyond Accuracy?
A: It's already in production. Today CallSphere runs this pattern in Salon and After-Hours Escalation, alongside the other live verticals (Healthcare, Real Estate, Salon, Sales, After-Hours Escalation, IT Helpdesk). The same orchestrator code path serves voice and chat — the difference is the tool set the router exposes.
Want to see healthcare agents handle real traffic? Spin up a walkthrough at https://healthcare.callsphere.tech or grab 20 minutes on the calendar: https://calendly.com/sagar-callsphere/new-meeting.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
The 2026 desktop AI agent landscape — ServiceNow Project Arc, Anthropic Claude offerings, OpenAI agents, and Google Mariner. A buyer's map.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI