TL;DR

A single eval score — accuracy, BLEU, an LLM-as-judge rubric — is the most misleading number on your release dashboard. It tells you whether the agent answered correctly. It tells you nothing about whether you can afford to ship it. The release that scores 96 might cost 4× and run 2.3× slower than the release that scores 94, and neither of those is automatically the right call.

What you want is a Pareto view across three axes: cost (tokens or dollars per turn), latency (p50 and p95 wall-clock), and quality (whatever your eval says). The right release is the one that dominates on the axis that matters for your product right now. This post shows how we build that dashboard for CallSphere agents using LangSmith trace metadata and a tiny amount of Python.

Why Quality-Only Eval Misleads

Three real release decisions from the last quarter — same eval suite, different right answers:

Voice agent, after-hours escalation. Quality moved 92.1 → 93.4. p95 latency moved 1.1s → 1.9s. We rolled back. On a phone call, anything past 1.5s p95 is a hang-up risk; the +1.3 quality points were not worth losing 8% of calls.
Chat agent, healthcare intake. Quality moved 89 → 91. Cost per turn moved $0.011 → $0.009. We shipped immediately. Better and cheaper, ship it.
Sales SDR agent. Quality moved 88 → 92. Cost per turn moved $0.018 → $0.041. We shipped because outbound conversion is dollar-dominant — a 4-point quality bump pays back the cost increase in two extra meetings booked per week.

Same delta in eval score, three different decisions. The eval score does not contain enough information to decide. The dashboard does.

The Three Axes

graph TD
  R[Release candidate] --> Q[Quality<br/>LLM-as-judge,<br/>trajectory pass-rate,<br/>task completion]
  R --> C[Cost<br/>total_tokens,<br/>total_cost from LangSmith,<br/>$ per turn]
  R --> L[Latency<br/>p50, p95 wall-clock,<br/>time-to-first-token,<br/>tool-call latency]

  Q --> D{Pareto check<br/>vs current prod}
  C --> D
  L --> D

  D -->|dominates on<br/>>=1 axis,<br/>tied on rest| S[Ship]
  D -->|regresses on<br/>any axis| H[Hold + investigate]
  D -->|tradeoff| T[Product call:<br/>which axis matters?]

Three axes, three rules:

Quality — your eval suite. Final-answer accuracy, trajectory pass-rate, task completion, LLM-as-judge — pick the one that correlates with the user outcome. We use a weighted blend of trajectory correctness and final-answer groundedness.
Cost — total_tokens and total_cost from trace metadata, normalized to dollars per completed turn. Per the LangSmith observability docs, every trace exposes both fields automatically when models are configured with token pricing.
Latency — p50 and p95 in milliseconds. For voice, p95 is the gate. For async chat, p50 + a long-tail SLO. For batch agents, total wall-clock matters more than either.

Reading the Numbers Out of LangSmith

The data you need is already in the trace tree. The LangSmith observability docs document the exact fields:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

// types from langsmith
interface Run {
  id: string;
  start_time: string;
  end_time: string;
  total_tokens: number | null;   // sum across all child LLM calls
  prompt_tokens: number | null;
  completion_tokens: number | null;
  total_cost: number | null;     // USD, computed from model pricing
  latency: number | null;        // ms
  outputs: Record<string, unknown>;
  child_runs: Run[];
}

A minimal cost-and-latency aggregator over an experiment's traces:

// scripts/cost-latency-agg.ts
import { Client } from 'langsmith';

interface AggRow {
  experiment: string;
  n: number;
  quality: number;        // mean of your scoring evaluator
  meanCost: number;       // USD per turn
  p50Latency: number;     // ms
  p95Latency: number;     // ms
  meanTokens: number;
}

const client = new Client();

async function aggregate(experimentName: string): Promise<AggRow> {
  const runs: any[] = [];
  for await (const r of client.listRuns({
    projectName: experimentName,
    executionOrder: 1,        // top-level traces only
  })) {
    runs.push(r);
  }

  const costs = runs.map(r => r.total_cost ?? 0);
  const tokens = runs.map(r => r.total_tokens ?? 0);
  const latencies = runs
    .map(r => new Date(r.end_time).getTime() - new Date(r.start_time).getTime())
    .sort((a, b) => a - b);

  const qualityScores = runs.flatMap(r =>
    (r.feedback_stats?.quality?.avg !== undefined)
      ? [r.feedback_stats.quality.avg as number]
      : []
  );

  return {
    experiment: experimentName,
    n: runs.length,
    quality: mean(qualityScores),
    meanCost: mean(costs),
    meanTokens: mean(tokens),
    p50Latency: percentile(latencies, 0.5),
    p95Latency: percentile(latencies, 0.95),
  };
}

const mean = (xs: number[]) =>
  xs.length ? xs.reduce((a, b) => a + b, 0) / xs.length : 0;

const percentile = (sorted: number[], p: number) =>
  sorted.length ? sorted[Math.floor((sorted.length - 1) * p)] : 0;

That is the entire ETL — no warehouse, no Spark job. Run it after every evaluate() call and shove the row into a tiny SQLite or Postgres table keyed by git SHA. The dashboard is a SQL query.

Building the Pareto View

A release candidate dominates when it is at least as good as production on every axis and strictly better on at least one. Anything else is a tradeoff, and tradeoffs need a human.

# pareto.py
from dataclasses import dataclass

@dataclass
class Release:
    name: str
    quality: float       # higher is better
    cost: float          # lower is better, USD per turn
    p95_latency_ms: int  # lower is better

def dominates(a: Release, b: Release) -> bool:
    """a dominates b iff a >= b on every axis and a > b on at least one."""
    at_least_as_good = (
        a.quality >= b.quality
        and a.cost <= b.cost
        and a.p95_latency_ms <= b.p95_latency_ms
    )
    strictly_better = (
        a.quality > b.quality
        or a.cost < b.cost
        or a.p95_latency_ms < b.p95_latency_ms
    )
    return at_least_as_good and strictly_better

def classify(candidate: Release, prod: Release) -> str:
    if dominates(candidate, prod):
        return "SHIP"
    if dominates(prod, candidate):
        return "REGRESSION"
    return "TRADEOFF"   # human must decide

Three outputs, three colors on the dashboard. Anything green ships automatically. Anything red blocks. Anything yellow goes to the product owner with the specific axis that regressed.

Real Dashboard, Real Numbers

This is the actual release table for our healthcare voice agent across the last six candidates. Production is candidate v2.3.1.

Candidate	Quality	$/turn	p95 latency	Decision
v2.3.1 (prod)	94.6	$0.0140	1,180ms	baseline
v2.3.2	94.7	$0.0098	1,090ms	SHIP (dominates)
v2.3.3	95.8	$0.0211	1,420ms	TRADEOFF (+1.2 quality, +51% cost, +20% p95)
v2.3.4	94.4	$0.0132	1,050ms	TRADEOFF (-0.2 quality, -6% cost, -11% p95)
v2.3.5	96.1	$0.0145	1,920ms	HOLD (p95 over 1.5s gate)
v2.3.6	95.0	$0.0091	980ms	SHIP (dominates baseline + v2.3.2)

Two automatic ships, one automatic hold (p95 budget), and two genuine tradeoffs that needed a human. Without the three-axis view, v2.3.5 (the highest-quality candidate) would have been the obvious choice — and it would have tanked our hang-up rate.

Latency Decomposition Matters as Much as Total

A single p95 number hides where the time goes. The right move is to attribute latency to layers so you know which knob to turn. We instrument three:

Layer	Typical share of p95	Lever
LLM inference	40–60%	Smaller model on simple intents, prompt caching
Tool calls (DB, webhooks, RAG)	25–45%	Add indexes, cache, reduce N+1 lookups
Orchestration overhead	5–15%	Streaming, parallel tool calls, prune child runs

The trace tree gives you all three for free. We pull child_runs grouped by run_type (llm, tool, chain) and sum the wall-clocks. When p95 regresses, the decomposition table tells you whether to look at the prompt, the database, or the graph.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Cost Levers Worth Knowing

When the cost axis is the one regressing, in rough order of impact:

Drop unnecessary context. Half of cost regressions come from accidentally re-summarizing the conversation history at every step. Audit prompt_tokens per step.
Right-size the model. Mini-class models on classification, planner-class on synthesis, frontier-class only when truly needed. Routing alone has saved us 35–50% on chat agents.
Prompt caching. Cuts repeated-prefix cost by ~90%. Free win on long system prompts.
Cap tool-call retries. Two retries, then escalate. Three+ retries is almost never worth the cost.
Aggressive structured outputs. JSON-mode or grammar-constrained outputs cut completion tokens by 20–40% versus free-form text.

How CallSphere Ships Releases on This

Every release of every agent in the CallSphere product line — voice and chat, healthcare, real estate, sales, IT helpdesk, salon, after-hours — runs through the three-axis gate. We post the SHIP/HOLD/TRADEOFF result in a Slack channel that engineering, product, and ops all watch. Engineering does not unilaterally ship tradeoffs; product calls them, with the cost and latency numbers in front of them.

The dashboard is the contract. It has eliminated the "we shipped because the score went up" failure mode and cut our rollback rate by roughly two-thirds.

Common Anti-Patterns

Reporting only mean latency. Means hide tail behavior. p95 (and p99 for voice) is what users feel.
Reporting cost in tokens. Tokens are not money — different models have wildly different prices. Always report dollars per turn.
Single-number composite scoring. 0.5*quality - 0.3*cost - 0.2*latency looks rigorous and is mostly garbage. The weights are arbitrary, the units don't match, and it hides the tradeoff. Show three numbers.
No production sample. CI evals run on a curated dataset; production traffic does not match it. Sample 1–5% of prod traces through the same evaluators or the dashboard is fiction.
No SLO. The dashboard is colored thresholds, and thresholds need agreement. Pick a p95 latency budget, a cost-per-turn budget, and a quality floor before you start shipping releases.

FAQ

Q1: How do I get total_cost to show up if I'm using a custom model? Configure the model's pricing in your LangSmith model registry, or set metadata={"cost": ...} on each LLM call. The trace tree will then aggregate it automatically.

Q2: What's a reasonable p95 latency budget? For realtime voice, 1.0–1.5 seconds end-to-end. For chat, 3–5 seconds. For async or batch, whatever the user-facing SLA is. The number matters less than having one.

Q3: How do I weight cost vs quality? Don't. Show both, classify the release as SHIP/REGRESSION/TRADEOFF, and let a product owner decide on tradeoffs with the actual numbers in front of them.

Q4: Should I run this on every PR? Yes — but on a small dataset (50–200 examples) so it stays under 10 minutes. Run the full suite (1k+ examples) nightly and on release candidates.

Q5: How do I detect cost regressions in production, not just CI? Sample production traces through the same aggregator on a rolling 24h window. Alert when mean cost-per-turn drifts more than 15% from the trailing 7-day baseline. We have caught two silent prompt regressions this quarter that way.

Cost-Aware Agent Evaluation: Putting Token Spend, Latency, and Quality on the Same Dashboard

TL;DR

Why Quality-Only Eval Misleads

The Three Axes

Reading the Numbers Out of LangSmith

Building the Pareto View

Real Dashboard, Real Numbers

Latency Decomposition Matters as Much as Total

Cost Levers Worth Knowing

How CallSphere Ships Releases on This

Common Anti-Patterns

FAQ

Try CallSphere AI Voice Agents

Related Articles You May Like

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

From Trace to Production Fix: An End-to-End Observability Workflow for Agents

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How to Build a Golden Dataset for Production AI Agents

Evaluating Multi-Step Tool-Using Agents: Why End-to-End Metrics Lie

Continuous Evaluation: Wiring LangSmith into Your CI/CD for Agent Releases