Cost-Aware Agent Evaluation: Putting Token Spend, Latency, and Quality on the Same Dashboard
Eval scores alone mislead. Here is how we build a Pareto view across cost, latency, and quality so agent releases ship on signal, not vibes.
TL;DR
A single eval score — accuracy, BLEU, an LLM-as-judge rubric — is the most misleading number on your release dashboard. It tells you whether the agent answered correctly. It tells you nothing about whether you can afford to ship it. The release that scores 96 might cost 4× and run 2.3× slower than the release that scores 94, and neither of those is automatically the right call.
What you want is a Pareto view across three axes: cost (tokens or dollars per turn), latency (p50 and p95 wall-clock), and quality (whatever your eval says). The right release is the one that dominates on the axis that matters for your product right now. This post shows how we build that dashboard for CallSphere agents using LangSmith trace metadata and a tiny amount of Python.
Why Quality-Only Eval Misleads
Three real release decisions from the last quarter — same eval suite, different right answers:
- Voice agent, after-hours escalation. Quality moved 92.1 → 93.4. p95 latency moved 1.1s → 1.9s. We rolled back. On a phone call, anything past 1.5s p95 is a hang-up risk; the +1.3 quality points were not worth losing 8% of calls.
- Chat agent, healthcare intake. Quality moved 89 → 91. Cost per turn moved $0.011 → $0.009. We shipped immediately. Better and cheaper, ship it.
- Sales SDR agent. Quality moved 88 → 92. Cost per turn moved $0.018 → $0.041. We shipped because outbound conversion is dollar-dominant — a 4-point quality bump pays back the cost increase in two extra meetings booked per week.
Same delta in eval score, three different decisions. The eval score does not contain enough information to decide. The dashboard does.
The Three Axes
graph TD
R[Release candidate] --> Q[Quality<br/>LLM-as-judge,<br/>trajectory pass-rate,<br/>task completion]
R --> C[Cost<br/>total_tokens,<br/>total_cost from LangSmith,<br/>$ per turn]
R --> L[Latency<br/>p50, p95 wall-clock,<br/>time-to-first-token,<br/>tool-call latency]
Q --> D{Pareto check<br/>vs current prod}
C --> D
L --> D
D -->|dominates on<br/>>=1 axis,<br/>tied on rest| S[Ship]
D -->|regresses on<br/>any axis| H[Hold + investigate]
D -->|tradeoff| T[Product call:<br/>which axis matters?]
Three axes, three rules:
- Quality — your eval suite. Final-answer accuracy, trajectory pass-rate, task completion, LLM-as-judge — pick the one that correlates with the user outcome. We use a weighted blend of trajectory correctness and final-answer groundedness.
- Cost —
total_tokensandtotal_costfrom trace metadata, normalized to dollars per completed turn. Per the LangSmith observability docs, every trace exposes both fields automatically when models are configured with token pricing. - Latency — p50 and p95 in milliseconds. For voice, p95 is the gate. For async chat, p50 + a long-tail SLO. For batch agents, total wall-clock matters more than either.
Reading the Numbers Out of LangSmith
The data you need is already in the trace tree. The LangSmith observability docs document the exact fields:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
// types from langsmith
interface Run {
id: string;
start_time: string;
end_time: string;
total_tokens: number | null; // sum across all child LLM calls
prompt_tokens: number | null;
completion_tokens: number | null;
total_cost: number | null; // USD, computed from model pricing
latency: number | null; // ms
outputs: Record<string, unknown>;
child_runs: Run[];
}
A minimal cost-and-latency aggregator over an experiment's traces:
// scripts/cost-latency-agg.ts
import { Client } from 'langsmith';
interface AggRow {
experiment: string;
n: number;
quality: number; // mean of your scoring evaluator
meanCost: number; // USD per turn
p50Latency: number; // ms
p95Latency: number; // ms
meanTokens: number;
}
const client = new Client();
async function aggregate(experimentName: string): Promise<AggRow> {
const runs: any[] = [];
for await (const r of client.listRuns({
projectName: experimentName,
executionOrder: 1, // top-level traces only
})) {
runs.push(r);
}
const costs = runs.map(r => r.total_cost ?? 0);
const tokens = runs.map(r => r.total_tokens ?? 0);
const latencies = runs
.map(r => new Date(r.end_time).getTime() - new Date(r.start_time).getTime())
.sort((a, b) => a - b);
const qualityScores = runs.flatMap(r =>
(r.feedback_stats?.quality?.avg !== undefined)
? [r.feedback_stats.quality.avg as number]
: []
);
return {
experiment: experimentName,
n: runs.length,
quality: mean(qualityScores),
meanCost: mean(costs),
meanTokens: mean(tokens),
p50Latency: percentile(latencies, 0.5),
p95Latency: percentile(latencies, 0.95),
};
}
const mean = (xs: number[]) =>
xs.length ? xs.reduce((a, b) => a + b, 0) / xs.length : 0;
const percentile = (sorted: number[], p: number) =>
sorted.length ? sorted[Math.floor((sorted.length - 1) * p)] : 0;
That is the entire ETL — no warehouse, no Spark job. Run it after every evaluate() call and shove the row into a tiny SQLite or Postgres table keyed by git SHA. The dashboard is a SQL query.
Building the Pareto View
A release candidate dominates when it is at least as good as production on every axis and strictly better on at least one. Anything else is a tradeoff, and tradeoffs need a human.
# pareto.py
from dataclasses import dataclass
@dataclass
class Release:
name: str
quality: float # higher is better
cost: float # lower is better, USD per turn
p95_latency_ms: int # lower is better
def dominates(a: Release, b: Release) -> bool:
"""a dominates b iff a >= b on every axis and a > b on at least one."""
at_least_as_good = (
a.quality >= b.quality
and a.cost <= b.cost
and a.p95_latency_ms <= b.p95_latency_ms
)
strictly_better = (
a.quality > b.quality
or a.cost < b.cost
or a.p95_latency_ms < b.p95_latency_ms
)
return at_least_as_good and strictly_better
def classify(candidate: Release, prod: Release) -> str:
if dominates(candidate, prod):
return "SHIP"
if dominates(prod, candidate):
return "REGRESSION"
return "TRADEOFF" # human must decide
Three outputs, three colors on the dashboard. Anything green ships automatically. Anything red blocks. Anything yellow goes to the product owner with the specific axis that regressed.
Real Dashboard, Real Numbers
This is the actual release table for our healthcare voice agent across the last six candidates. Production is candidate v2.3.1.
| Candidate | Quality | $/turn | p95 latency | Decision |
|---|---|---|---|---|
| v2.3.1 (prod) | 94.6 | $0.0140 | 1,180ms | baseline |
| v2.3.2 | 94.7 | $0.0098 | 1,090ms | SHIP (dominates) |
| v2.3.3 | 95.8 | $0.0211 | 1,420ms | TRADEOFF (+1.2 quality, +51% cost, +20% p95) |
| v2.3.4 | 94.4 | $0.0132 | 1,050ms | TRADEOFF (-0.2 quality, -6% cost, -11% p95) |
| v2.3.5 | 96.1 | $0.0145 | 1,920ms | HOLD (p95 over 1.5s gate) |
| v2.3.6 | 95.0 | $0.0091 | 980ms | SHIP (dominates baseline + v2.3.2) |
Two automatic ships, one automatic hold (p95 budget), and two genuine tradeoffs that needed a human. Without the three-axis view, v2.3.5 (the highest-quality candidate) would have been the obvious choice — and it would have tanked our hang-up rate.
Latency Decomposition Matters as Much as Total
A single p95 number hides where the time goes. The right move is to attribute latency to layers so you know which knob to turn. We instrument three:
| Layer | Typical share of p95 | Lever |
|---|---|---|
| LLM inference | 40–60% | Smaller model on simple intents, prompt caching |
| Tool calls (DB, webhooks, RAG) | 25–45% | Add indexes, cache, reduce N+1 lookups |
| Orchestration overhead | 5–15% | Streaming, parallel tool calls, prune child runs |
The trace tree gives you all three for free. We pull child_runs grouped by run_type (llm, tool, chain) and sum the wall-clocks. When p95 regresses, the decomposition table tells you whether to look at the prompt, the database, or the graph.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Cost Levers Worth Knowing
When the cost axis is the one regressing, in rough order of impact:
- Drop unnecessary context. Half of cost regressions come from accidentally re-summarizing the conversation history at every step. Audit
prompt_tokensper step. - Right-size the model. Mini-class models on classification, planner-class on synthesis, frontier-class only when truly needed. Routing alone has saved us 35–50% on chat agents.
- Prompt caching. Cuts repeated-prefix cost by ~90%. Free win on long system prompts.
- Cap tool-call retries. Two retries, then escalate. Three+ retries is almost never worth the cost.
- Aggressive structured outputs. JSON-mode or grammar-constrained outputs cut completion tokens by 20–40% versus free-form text.
How CallSphere Ships Releases on This
Every release of every agent in the CallSphere product line — voice and chat, healthcare, real estate, sales, IT helpdesk, salon, after-hours — runs through the three-axis gate. We post the SHIP/HOLD/TRADEOFF result in a Slack channel that engineering, product, and ops all watch. Engineering does not unilaterally ship tradeoffs; product calls them, with the cost and latency numbers in front of them.
The dashboard is the contract. It has eliminated the "we shipped because the score went up" failure mode and cut our rollback rate by roughly two-thirds.
Common Anti-Patterns
- Reporting only mean latency. Means hide tail behavior. p95 (and p99 for voice) is what users feel.
- Reporting cost in tokens. Tokens are not money — different models have wildly different prices. Always report dollars per turn.
- Single-number composite scoring.
0.5*quality - 0.3*cost - 0.2*latencylooks rigorous and is mostly garbage. The weights are arbitrary, the units don't match, and it hides the tradeoff. Show three numbers. - No production sample. CI evals run on a curated dataset; production traffic does not match it. Sample 1–5% of prod traces through the same evaluators or the dashboard is fiction.
- No SLO. The dashboard is colored thresholds, and thresholds need agreement. Pick a p95 latency budget, a cost-per-turn budget, and a quality floor before you start shipping releases.
FAQ
Q1: How do I get total_cost to show up if I'm using a custom model?
Configure the model's pricing in your LangSmith model registry, or set metadata={"cost": ...} on each LLM call. The trace tree will then aggregate it automatically.
Q2: What's a reasonable p95 latency budget? For realtime voice, 1.0–1.5 seconds end-to-end. For chat, 3–5 seconds. For async or batch, whatever the user-facing SLA is. The number matters less than having one.
Q3: How do I weight cost vs quality? Don't. Show both, classify the release as SHIP/REGRESSION/TRADEOFF, and let a product owner decide on tradeoffs with the actual numbers in front of them.
Q4: Should I run this on every PR? Yes — but on a small dataset (50–200 examples) so it stays under 10 minutes. Run the full suite (1k+ examples) nightly and on release candidates.
Q5: How do I detect cost regressions in production, not just CI? Sample production traces through the same aggregator on a rolling 24h window. Alert when mean cost-per-turn drifts more than 15% from the trailing 7-day baseline. We have caught two silent prompt regressions this quarter that way.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.