---
title: "Cost-Aware Agent Evaluation: Putting Token Spend, Latency, and Quality on the Same Dashboard"
description: "Eval scores alone mislead. Here is how we build a Pareto view across cost, latency, and quality so agent releases ship on signal, not vibes."
canonical: https://callsphere.ai/blog/cost-aware-agent-evaluation-token-spend-latency-quality
category: "Agentic AI"
tags: ["Agent Cost Tracking", "LangSmith Cost", "Agent Latency Evaluation", "Pareto Frontier", "LLMOps", "Cost Optimization", "AI Engineering"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.450Z
---

# Cost-Aware Agent Evaluation: Putting Token Spend, Latency, and Quality on the Same Dashboard

> Eval scores alone mislead. Here is how we build a Pareto view across cost, latency, and quality so agent releases ship on signal, not vibes.

## TL;DR

A single eval score — accuracy, BLEU, an LLM-as-judge rubric — is the **most misleading number on your release dashboard**. It tells you whether the agent answered correctly. It tells you nothing about whether you can afford to ship it. The release that scores 96 might cost 4× and run 2.3× slower than the release that scores 94, and *neither of those is automatically the right call*.

What you want is a **Pareto view across three axes**: cost (tokens or dollars per turn), latency (p50 and p95 wall-clock), and quality (whatever your eval says). The right release is the one that dominates on the axis that matters for your product right now. This post shows how we build that dashboard for [CallSphere](/products) agents using LangSmith trace metadata and a tiny amount of Python.

## Why Quality-Only Eval Misleads

Three real release decisions from the last quarter — same eval suite, different right answers:

1. **Voice agent, after-hours escalation.** Quality moved 92.1 → 93.4. p95 latency moved 1.1s → 1.9s. We **rolled back**. On a phone call, anything past 1.5s p95 is a hang-up risk; the +1.3 quality points were not worth losing 8% of calls.
2. **Chat agent, healthcare intake.** Quality moved 89 → 91. Cost per turn moved $0.011 → $0.009. We **shipped immediately**. Better and cheaper, ship it.
3. **Sales SDR agent.** Quality moved 88 → 92. Cost per turn moved $0.018 → $0.041. We **shipped** because outbound conversion is dollar-dominant — a 4-point quality bump pays back the cost increase in two extra meetings booked per week.

Same delta in eval score, three different decisions. **The eval score does not contain enough information to decide.** The dashboard does.

## The Three Axes

```mermaid
graph TD
  R[Release candidate] --> Q[Quality
LLM-as-judge,
trajectory pass-rate,
task completion]
  R --> C[Cost
total_tokens,
total_cost from LangSmith,
$ per turn]
  R --> L[Latency
p50, p95 wall-clock,
time-to-first-token,
tool-call latency]

  Q --> D{Pareto check
vs current prod}
  C --> D
  L --> D

  D -->|dominates on
>=1 axis,
tied on rest| S[Ship]
  D -->|regresses on
any axis| H[Hold + investigate]
  D -->|tradeoff| T[Product call:
which axis matters?]
```

Three axes, three rules:

- **Quality** — your eval suite. Final-answer accuracy, trajectory pass-rate, task completion, LLM-as-judge — pick the one that correlates with the user outcome. We use a weighted blend of trajectory correctness and final-answer groundedness.
- **Cost** — `total_tokens` and `total_cost` from trace metadata, normalized to *dollars per completed turn*. Per the [LangSmith observability docs](https://docs.langchain.com/langsmith/observability), every trace exposes both fields automatically when models are configured with token pricing.
- **Latency** — p50 and p95 in milliseconds. For voice, p95 is the gate. For async chat, p50 + a long-tail SLO. For batch agents, total wall-clock matters more than either.

## Reading the Numbers Out of LangSmith

The data you need is already in the trace tree. The [LangSmith observability docs](https://docs.langchain.com/langsmith/observability) document the exact fields:

```typescript
// types from langsmith
interface Run {
  id: string;
  start_time: string;
  end_time: string;
  total_tokens: number | null;   // sum across all child LLM calls
  prompt_tokens: number | null;
  completion_tokens: number | null;
  total_cost: number | null;     // USD, computed from model pricing
  latency: number | null;        // ms
  outputs: Record;
  child_runs: Run[];
}
```

A minimal cost-and-latency aggregator over an experiment's traces:

```typescript
// scripts/cost-latency-agg.ts
import { Client } from 'langsmith';

interface AggRow {
  experiment: string;
  n: number;
  quality: number;        // mean of your scoring evaluator
  meanCost: number;       // USD per turn
  p50Latency: number;     // ms
  p95Latency: number;     // ms
  meanTokens: number;
}

const client = new Client();

async function aggregate(experimentName: string): Promise {
  const runs: any[] = [];
  for await (const r of client.listRuns({
    projectName: experimentName,
    executionOrder: 1,        // top-level traces only
  })) {
    runs.push(r);
  }

  const costs = runs.map(r => r.total_cost ?? 0);
  const tokens = runs.map(r => r.total_tokens ?? 0);
  const latencies = runs
    .map(r => new Date(r.end_time).getTime() - new Date(r.start_time).getTime())
    .sort((a, b) => a - b);

  const qualityScores = runs.flatMap(r =>
    (r.feedback_stats?.quality?.avg !== undefined)
      ? [r.feedback_stats.quality.avg as number]
      : []
  );

  return {
    experiment: experimentName,
    n: runs.length,
    quality: mean(qualityScores),
    meanCost: mean(costs),
    meanTokens: mean(tokens),
    p50Latency: percentile(latencies, 0.5),
    p95Latency: percentile(latencies, 0.95),
  };
}

const mean = (xs: number[]) =>
  xs.length ? xs.reduce((a, b) => a + b, 0) / xs.length : 0;

const percentile = (sorted: number[], p: number) =>
  sorted.length ? sorted[Math.floor((sorted.length - 1) * p)] : 0;
```

That is the entire ETL — no warehouse, no Spark job. Run it after every `evaluate()` call and shove the row into a tiny SQLite or Postgres table keyed by git SHA. The dashboard is a SQL query.

## Building the Pareto View

A **release candidate dominates** when it is at least as good as production on every axis and strictly better on at least one. Anything else is a tradeoff, and tradeoffs need a human.

```python
# pareto.py
from dataclasses import dataclass

@dataclass
class Release:
    name: str
    quality: float       # higher is better
    cost: float          # lower is better, USD per turn
    p95_latency_ms: int  # lower is better

def dominates(a: Release, b: Release) -> bool:
    """a dominates b iff a >= b on every axis and a > b on at least one."""
    at_least_as_good = (
        a.quality >= b.quality
        and a.cost  b.quality
        or a.cost  str:
    if dominates(candidate, prod):
        return "SHIP"
    if dominates(prod, candidate):
        return "REGRESSION"
    return "TRADEOFF"   # human must decide
```

Three outputs, three colors on the dashboard. Anything green ships automatically. Anything red blocks. Anything yellow goes to the product owner with the specific axis that regressed.

## Real Dashboard, Real Numbers

This is the actual release table for our healthcare voice agent across the last six candidates. Production is candidate `v2.3.1`.

| Candidate | Quality | $/turn | p95 latency | Decision |
| --- | --- | --- | --- | --- |
| v2.3.1 (prod) | 94.6 | $0.0140 | 1,180ms | baseline |
| v2.3.2 | 94.7 | $0.0098 | 1,090ms | **SHIP** (dominates) |
| v2.3.3 | 95.8 | $0.0211 | 1,420ms | TRADEOFF (+1.2 quality, +51% cost, +20% p95) |
| v2.3.4 | 94.4 | $0.0132 | 1,050ms | TRADEOFF (-0.2 quality, -6% cost, -11% p95) |
| v2.3.5 | 96.1 | $0.0145 | 1,920ms | HOLD (p95 over 1.5s gate) |
| v2.3.6 | 95.0 | $0.0091 | 980ms | **SHIP** (dominates baseline + v2.3.2) |

Two automatic ships, one automatic hold (p95 budget), and two genuine tradeoffs that needed a human. Without the three-axis view, v2.3.5 (the highest-quality candidate) would have been the obvious choice — and it would have tanked our hang-up rate.

## Latency Decomposition Matters as Much as Total

A single p95 number hides where the time goes. The right move is to attribute latency to **layers** so you know which knob to turn. We instrument three:

| Layer | Typical share of p95 | Lever |
| --- | --- | --- |
| LLM inference | 40–60% | Smaller model on simple intents, prompt caching |
| Tool calls (DB, webhooks, RAG) | 25–45% | Add indexes, cache, reduce N+1 lookups |
| Orchestration overhead | 5–15% | Streaming, parallel tool calls, prune child runs |

The trace tree gives you all three for free. We pull `child_runs` grouped by `run_type` (`llm`, `tool`, `chain`) and sum the wall-clocks. When p95 regresses, the decomposition table tells you whether to look at the prompt, the database, or the graph.

## Cost Levers Worth Knowing

When the cost axis is the one regressing, in rough order of impact:

1. **Drop unnecessary context.** Half of cost regressions come from accidentally re-summarizing the conversation history at every step. Audit `prompt_tokens` per step.
2. **Right-size the model.** Mini-class models on classification, planner-class on synthesis, frontier-class only when truly needed. Routing alone has saved us 35–50% on chat agents.
3. **Prompt caching.** Cuts repeated-prefix cost by ~90%. Free win on long system prompts.
4. **Cap tool-call retries.** Two retries, then escalate. Three+ retries is almost never worth the cost.
5. **Aggressive structured outputs.** JSON-mode or grammar-constrained outputs cut completion tokens by 20–40% versus free-form text.

## How CallSphere Ships Releases on This

Every release of every agent in the [CallSphere product line](/products) — voice and chat, healthcare, real estate, sales, IT helpdesk, salon, after-hours — runs through the three-axis gate. We post the SHIP/HOLD/TRADEOFF result in a Slack channel that engineering, product, and ops all watch. Engineering does not unilaterally ship tradeoffs; product calls them, with the cost and latency numbers in front of them.

The dashboard is the contract. It has eliminated the "we shipped because the score went up" failure mode and cut our rollback rate by roughly two-thirds.

## Common Anti-Patterns

- **Reporting only mean latency.** Means hide tail behavior. p95 (and p99 for voice) is what users feel.
- **Reporting cost in tokens.** Tokens are not money — different models have wildly different prices. Always report dollars per turn.
- **Single-number composite scoring.** `0.5*quality - 0.3*cost - 0.2*latency` looks rigorous and is mostly garbage. The weights are arbitrary, the units don't match, and it hides the tradeoff. Show three numbers.
- **No production sample.** CI evals run on a curated dataset; production traffic does not match it. Sample 1–5% of prod traces through the same evaluators or the dashboard is fiction.
- **No SLO.** The dashboard is colored thresholds, and thresholds need agreement. Pick a p95 latency budget, a cost-per-turn budget, and a quality floor *before* you start shipping releases.

## FAQ

**Q1: How do I get `total_cost` to show up if I'm using a custom model?**
Configure the model's pricing in your LangSmith model registry, or set `metadata={"cost": ...}` on each LLM call. The trace tree will then aggregate it automatically.

**Q2: What's a reasonable p95 latency budget?**
For realtime voice, 1.0–1.5 seconds end-to-end. For chat, 3–5 seconds. For async or batch, whatever the user-facing SLA is. The number matters less than having one.

**Q3: How do I weight cost vs quality?**
Don't. Show both, classify the release as SHIP/REGRESSION/TRADEOFF, and let a product owner decide on tradeoffs with the actual numbers in front of them.

**Q4: Should I run this on every PR?**
Yes — but on a small dataset (50–200 examples) so it stays under 10 minutes. Run the full suite (1k+ examples) nightly and on release candidates.

**Q5: How do I detect cost regressions in production, not just CI?**
Sample production traces through the same aggregator on a rolling 24h window. Alert when mean cost-per-turn drifts more than 15% from the trailing 7-day baseline. We have caught two silent prompt regressions this quarter that way.

---

Source: https://callsphere.ai/blog/cost-aware-agent-evaluation-token-spend-latency-quality