By Sagar Shankaran, Founder of CallSphere
Eval scores alone mislead. Here is how we build a Pareto view across cost, latency, and quality so agent releases ship on signal, not vibes.
Key takeaways
A single eval score — accuracy, BLEU, an LLM-as-judge rubric — is the most misleading number on your release dashboard. It tells you whether the agent answered correctly. It tells you nothing about whether you can afford to ship it. The release that scores 96 might cost 4× and run 2.3× slower than the release that scores 94, and neither of those is automatically the right call.
What you want is a Pareto view across three axes: cost (tokens or dollars per turn), latency (p50 and p95 wall-clock), and quality (whatever your eval says). The right release is the one that dominates on the axis that matters for your product right now. This post shows how we build that dashboard for CallSphere agents using LangSmith trace metadata and a tiny amount of Python.
Three real release decisions from the last quarter — same eval suite, different right answers:
Same delta in eval score, three different decisions. The eval score does not contain enough information to decide. The dashboard does.
graph TD
R[Release candidate] --> Q[Quality<br/>LLM-as-judge,<br/>trajectory pass-rate,<br/>task completion]
R --> C[Cost<br/>total_tokens,<br/>total_cost from LangSmith,<br/>$ per turn]
R --> L[Latency<br/>p50, p95 wall-clock,<br/>time-to-first-token,<br/>tool-call latency]
Q --> D{Pareto check<br/>vs current prod}
C --> D
L --> D
D -->|dominates on<br/>>=1 axis,<br/>tied on rest| S[Ship]
D -->|regresses on<br/>any axis| H[Hold + investigate]
D -->|tradeoff| T[Product call:<br/>which axis matters?]
Three axes, three rules:
total_tokens and total_cost from trace metadata, normalized to dollars per completed turn. Per the LangSmith observability docs, every trace exposes both fields automatically when models are configured with token pricing.The data you need is already in the trace tree. The LangSmith observability docs document the exact fields:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
// types from langsmith
interface Run {
id: string;
start_time: string;
end_time: string;
total_tokens: number | null; // sum across all child LLM calls
prompt_tokens: number | null;
completion_tokens: number | null;
total_cost: number | null; // USD, computed from model pricing
latency: number | null; // ms
outputs: Record<string, unknown>;
child_runs: Run[];
}
A minimal cost-and-latency aggregator over an experiment's traces:
// scripts/cost-latency-agg.ts
import { Client } from 'langsmith';
interface AggRow {
experiment: string;
n: number;
quality: number; // mean of your scoring evaluator
meanCost: number; // USD per turn
p50Latency: number; // ms
p95Latency: number; // ms
meanTokens: number;
}
const client = new Client();
async function aggregate(experimentName: string): Promise<AggRow> {
const runs: any[] = [];
for await (const r of client.listRuns({
projectName: experimentName,
executionOrder: 1, // top-level traces only
})) {
runs.push(r);
}
const costs = runs.map(r => r.total_cost ?? 0);
const tokens = runs.map(r => r.total_tokens ?? 0);
const latencies = runs
.map(r => new Date(r.end_time).getTime() - new Date(r.start_time).getTime())
.sort((a, b) => a - b);
const qualityScores = runs.flatMap(r =>
(r.feedback_stats?.quality?.avg !== undefined)
? [r.feedback_stats.quality.avg as number]
: []
);
return {
experiment: experimentName,
n: runs.length,
quality: mean(qualityScores),
meanCost: mean(costs),
meanTokens: mean(tokens),
p50Latency: percentile(latencies, 0.5),
p95Latency: percentile(latencies, 0.95),
};
}
const mean = (xs: number[]) =>
xs.length ? xs.reduce((a, b) => a + b, 0) / xs.length : 0;
const percentile = (sorted: number[], p: number) =>
sorted.length ? sorted[Math.floor((sorted.length - 1) * p)] : 0;
That is the entire ETL — no warehouse, no Spark job. Run it after every evaluate() call and shove the row into a tiny SQLite or Postgres table keyed by git SHA. The dashboard is a SQL query.
A release candidate dominates when it is at least as good as production on every axis and strictly better on at least one. Anything else is a tradeoff, and tradeoffs need a human.
# pareto.py
from dataclasses import dataclass
@dataclass
class Release:
name: str
quality: float # higher is better
cost: float # lower is better, USD per turn
p95_latency_ms: int # lower is better
def dominates(a: Release, b: Release) -> bool:
"""a dominates b iff a >= b on every axis and a > b on at least one."""
at_least_as_good = (
a.quality >= b.quality
and a.cost <= b.cost
and a.p95_latency_ms <= b.p95_latency_ms
)
strictly_better = (
a.quality > b.quality
or a.cost < b.cost
or a.p95_latency_ms < b.p95_latency_ms
)
return at_least_as_good and strictly_better
def classify(candidate: Release, prod: Release) -> str:
if dominates(candidate, prod):
return "SHIP"
if dominates(prod, candidate):
return "REGRESSION"
return "TRADEOFF" # human must decide
Three outputs, three colors on the dashboard. Anything green ships automatically. Anything red blocks. Anything yellow goes to the product owner with the specific axis that regressed.
This is the actual release table for our healthcare voice agent across the last six candidates. Production is candidate v2.3.1.
| Candidate | Quality | $/turn | p95 latency | Decision |
|---|---|---|---|---|
| v2.3.1 (prod) | 94.6 | $0.0140 | 1,180ms | baseline |
| v2.3.2 | 94.7 | $0.0098 | 1,090ms | SHIP (dominates) |
| v2.3.3 | 95.8 | $0.0211 | 1,420ms | TRADEOFF (+1.2 quality, +51% cost, +20% p95) |
| v2.3.4 | 94.4 | $0.0132 | 1,050ms | TRADEOFF (-0.2 quality, -6% cost, -11% p95) |
| v2.3.5 | 96.1 | $0.0145 | 1,920ms | HOLD (p95 over 1.5s gate) |
| v2.3.6 | 95.0 | $0.0091 | 980ms | SHIP (dominates baseline + v2.3.2) |
Two automatic ships, one automatic hold (p95 budget), and two genuine tradeoffs that needed a human. Without the three-axis view, v2.3.5 (the highest-quality candidate) would have been the obvious choice — and it would have tanked our hang-up rate.
A single p95 number hides where the time goes. The right move is to attribute latency to layers so you know which knob to turn. We instrument three:
| Layer | Typical share of p95 | Lever |
|---|---|---|
| LLM inference | 40–60% | Smaller model on simple intents, prompt caching |
| Tool calls (DB, webhooks, RAG) | 25–45% | Add indexes, cache, reduce N+1 lookups |
| Orchestration overhead | 5–15% | Streaming, parallel tool calls, prune child runs |
The trace tree gives you all three for free. We pull child_runs grouped by run_type (llm, tool, chain) and sum the wall-clocks. When p95 regresses, the decomposition table tells you whether to look at the prompt, the database, or the graph.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
When the cost axis is the one regressing, in rough order of impact:
prompt_tokens per step.Every release of every agent in the CallSphere product line — voice and chat, healthcare, real estate, sales, IT helpdesk, salon, after-hours — runs through the three-axis gate. We post the SHIP/HOLD/TRADEOFF result in a Slack channel that engineering, product, and ops all watch. Engineering does not unilaterally ship tradeoffs; product calls them, with the cost and latency numbers in front of them.
The dashboard is the contract. It has eliminated the "we shipped because the score went up" failure mode and cut our rollback rate by roughly two-thirds.
0.5*quality - 0.3*cost - 0.2*latency looks rigorous and is mostly garbage. The weights are arbitrary, the units don't match, and it hides the tradeoff. Show three numbers.Q1: How do I get total_cost to show up if I'm using a custom model?
Configure the model's pricing in your LangSmith model registry, or set metadata={"cost": ...} on each LLM call. The trace tree will then aggregate it automatically.
Q2: What's a reasonable p95 latency budget? For realtime voice, 1.0–1.5 seconds end-to-end. For chat, 3–5 seconds. For async or batch, whatever the user-facing SLA is. The number matters less than having one.
Q3: How do I weight cost vs quality? Don't. Show both, classify the release as SHIP/REGRESSION/TRADEOFF, and let a product owner decide on tradeoffs with the actual numbers in front of them.
Q4: Should I run this on every PR? Yes — but on a small dataset (50–200 examples) so it stays under 10 minutes. Run the full suite (1k+ examples) nightly and on release candidates.
Q5: How do I detect cost regressions in production, not just CI? Sample production traces through the same aggregator on a rolling 24h window. Alert when mean cost-per-turn drifts more than 15% from the trailing 7-day baseline. We have caught two silent prompt regressions this quarter that way.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for multilingual customer support — a May 2026 comparison grounded in current model prices, benchm...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.