---
title: "Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss"
description: "The full metric set for evaluating production voice agents — STT word error rate, end-to-end latency budgets, RAG grounding, prosody, and the metrics that actually correlate with retention."
canonical: https://callsphere.ai/blog/voice-agent-quality-metrics-wer-latency-grounding
category: "Agentic AI"
tags: ["Voice Agents", "Agent Evaluation", "WER", "Production AI", "Conversational AI", "OpenAI Realtime API"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.628Z
---

# Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss

> The full metric set for evaluating production voice agents — STT word error rate, end-to-end latency budgets, RAG grounding, prosody, and the metrics that actually correlate with retention.

## TL;DR

Most voice agent teams measure two metrics — accuracy and latency — and call it done. That gets you to a working demo. It does not get you to a system that survives 50,000 calls a week without a clinic, a real estate office, or a help-desk customer noticing that something is off. The metric set that actually predicts retention is **layered**: STT, NLU, agent reasoning, TTS, and system, each with its own evaluators, each tied to user-facing outcomes like containment and CSAT. This post is the full metric model we run on [CallSphere](/products), with the formulas, instrumentation snippets, the comparison table that tells you which metric catches which class of bug, and the three metrics most teams skip until they get burned.

## Why "Accuracy" Is Not a Metric

Senior engineers at voice-agent startups still tell me their bot is "92% accurate." On what task? Measured how? Against which dataset? On audio recorded in which acoustic conditions? Single-number quality claims are the surest sign a team has not yet hit production scale. Once you have, you learn quickly that voice agents fail in five distinct layers and a single number cannot catch any of them well.

The layered model:

1. **STT layer** — did we hear the user correctly?
2. **NLU layer** — did we understand the intent?
3. **Agent reasoning layer** — did we choose the right action and produce a correct, grounded response?
4. **TTS layer** — did the response sound natural and intelligible?
5. **System layer** — did the whole stack respond fast enough and recover from interruptions?

On top of those five, **user-facing metrics** (containment, transfer rate, CSAT) tie the technical layers to the business outcome. A bug in any layer can kill the user-facing metric, which is why isolating the layer is the entire point.

## The Layered Metric Pipeline

```mermaid
flowchart TD
  A[Recorded session] --> S[STT layer]
  A --> SYS[System spans]
  S -->|transcript| N[NLU layer]
  N -->|intent + slots| R[Agent reasoning]
  R -->|response text| T[TTS layer]
  S --> M1[WER, CER]
  N --> M2[Intent F1, slot accuracy]
  R --> M3[Correctness, faithfulness, tool acc]
  T --> M4[MOS proxy, prosody, intelligibility]
  SYS --> M5[p50/p95, TTFA, barge-in success]
  M1 & M2 & M3 & M4 & M5 --> AGG[Per-call quality score]
  AGG --> UF[Containment / Transfer / CSAT proxy]
  style AGG fill:#ffd
  style UF fill:#cfc
```

*Figure 1 — Each technical layer produces its own metrics; the user-facing layer is what the business cares about. The pipeline is what lets you blame the right layer when CSAT drops.*

## Layer 1 — STT Metrics

The two foundational metrics are **Word Error Rate** and **Character Error Rate**:

```
WER = (S + D + I) / N
```

Where S, D, I are substitutions, deletions, insertions against a human reference transcript and N is the reference word count. CER is the same formula at the character level — useful when you care about names, addresses, or alphanumeric strings (insurance IDs, license plates) where one misheard letter changes meaning.

We track both because they catch different bugs:

- **WER spikes** point to acoustic mismatch (new microphone class, new accent group in the user pool, codec change in the SIP bridge).
- **CER spikes with steady WER** point to entity-level errors — the model is hearing words right but mangling proper nouns. This is where domain-specific spelling biasing pays for itself.

A working WER implementation in TypeScript:

```ts
export function wordErrorRate(reference: string, hypothesis: string): number {
  const r = reference.toLowerCase().split(/\s+/);
  const h = hypothesis.toLowerCase().split(/\s+/);
  const dp: number[][] = Array.from({ length: r.length + 1 }, () =>
    Array(h.length + 1).fill(0)
  );
  for (let i = 0; i  float:
    resp = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": GROUNDING_PROMPT.format(
            transcript=transcript, tool_results=tool_results
        )}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    claims = json.loads(resp.choices[0].message.content)["claims"]
    return sum(c["grounded"] for c in claims) / max(len(claims), 1)
\`\`\`

Pin the judge model with a date stamp. Calibrate quarterly against a 50-row human-labeled subset. If judge-human agreement drops below 0.85, retrain or swap the judge.

## Layer 4 — TTS Metrics

Underrated and under-instrumented at most teams. The metrics:

- **MOS proxy** — Mean Opinion Score is traditionally collected from human raters; a neural MOS predictor (e.g., a small wav2vec2-based model fine-tuned on labeled MOS data) gives you a continuous proxy on every utterance for ~$0.0003 each.
- **Prosody score** — does the pitch contour match the punctuation? Questions should rise; statements should fall. We use a prosody-aware classifier scored 0–1.
- **Intelligibility** — round-trip the synthesized audio through a separate STT and compute WER against the intended text. If the TTS pronounces "fifteen" as "fifty," round-trip WER catches it.
- **Phoneme stress accuracy** — for names and brand terms, we maintain a pronunciation lexicon and score adherence.

Round-trip intelligibility is the cheapest and most useful of these. Every release should run it on a fixed phrase list:

```ts
const targetPhrase = "Your appointment is at 3:15 PM on Tuesday, May 6th.";
const synthesized = await tts(targetPhrase);
const recovered = await stt(synthesized);
const intelligibility = 1 - wordErrorRate(targetPhrase, recovered);
// Target: > 0.97 on standard phrasebook
\`\`\`

## Layer 5 — System Metrics

The plumbing layer. These are the metrics that predict abandonment more strongly than any reasoning metric:

| Metric | Definition | Our budget |
|---|---|---|
| Time-to-first-audio (TTFA) | Time from end-of-user-speech to first audio frame from agent | p50 ≤ 500 ms, p95 ≤ 800 ms |
| End-to-end latency | TTFA + response duration | p95 ≤ 3 s for short turns |
| Barge-in success rate | % of user interruptions where agent stops within 200 ms | ≥ 0.97 |
| Interruption recovery | % of post-barge-in turns where agent resumes the right task | ≥ 0.93 |
| Connection stability | Drops per 1000 sessions | < 4 |
| Tool latency p95 | Per-tool latency | varies, < 800 ms |

Barge-in success rate is the metric every team forgets and every user notices. We instrument it by sampling every 50th call and running an audio overlap detector offline.

## The User-Facing Layer

The technical layers exist to serve the user-facing layer. Three metrics that actually correlate with revenue:

- **Containment rate** — % of calls fully resolved by the agent without human transfer. For our [healthcare and after-hours](/industries) deployments, baseline is 68%; mature deployments hit 84%.
- **Transfer rate by reason** — when transfers happen, *why*. "Out of scope" is fine. "User frustration" is a quality alarm.
- **CSAT proxy** — we run a 5-second post-call survey on a sampled subset, plus a sentiment classifier on the full transcript corpus. The classifier-derived proxy correlates 0.81 with the survey CSAT, which is good enough to use as a continuous gauge.

## The Comparison Table — What Each Metric Catches

This is the table I print and put on the wall:

| Metric | What it catches | How to measure | Cost per 1k turns |
|---|---|---|---|
| WER | Acoustic mismatch, accent regressions | Levenshtein vs. human transcript | ~$0 (compute only) |
| CER | Entity-level mishearings | Char-level Levenshtein | ~$0 |
| Intent F1 | Routing failures | Confusion matrix on labeled set | ~$0 |
| Correctness | Wrong answers | LLM-as-judge | ~$2.40 |
| Groundedness | Hallucinated facts | LLM-as-judge over tool results | ~$3.10 |
| Tool-call correct | Wrong action taken | Structural diff on call args | ~$0 |
| MOS proxy | TTS quality regressions | Neural MOS predictor | ~$0.30 |
| Round-trip intelligibility | Mispronounced numbers/names | TTS → STT → WER | ~$0.40 |
| Prosody score | Robotic delivery | Classifier over pitch contour | ~$0.10 |
| TTFA p95 | Latency creep | Span timing on response.created | ~$0 |
| Barge-in success | User talks-over | Audio overlap detector | ~$0.05 |
| Containment | Business value | Session outcome label | ~$0 |
| CSAT proxy | User satisfaction | Sentiment classifier + survey | ~$0.20 |

The ones at the bottom — barge-in, prosody, intelligibility — are the most-skipped and the highest-leverage. Most teams add them only after a customer complaint forces it.

## Three Metrics Most Teams Skip Until They Get Burned

**1. Round-trip intelligibility.** I cannot count the number of voice agents I've heard say "your balance is fifty dollars" when the truth was "fifteen." A 30-line script catches every case.

**2. Confidence calibration.** If your routing rule says "transfer to human if confidence < 0.7" but your confidence scores are uncalibrated, the rule is noise. Run a reliability diagram quarterly.

**3. Interruption recovery.** Barge-in success is necessary but not sufficient. The harder question is: after the user interrupted you, did you correctly figure out what they wanted, or did you just stop talking and stand there? We measure this by labeling a sample of post-interruption turns as "recovered correctly" or not. Our number sits around 0.93; six months ago it was 0.78.

## Instrumentation — Where the Metrics Live

We hold a dual store: hot metrics in Datadog (TTFA, p95, barge-in success — anything that needs alerting), and cold metrics in a Postgres analytics schema joined against the LangSmith trace ID so any spike can be drilled to the underlying session. The eval replay runner from our [companion realtime build piece](/blog/openai-realtime-voice-agents-eval-pipeline-2026) writes its outputs into the same schema, so you can chart "WER on PR branch vs. WER on main over the last 30 days" with a SQL query.

If you build only one dashboard, build the per-layer breakdown for the last 24 hours, with each layer's score, the trend arrow against the prior 7-day average, and a click-through to the top 10 worst-scoring sessions per layer. That single view replaces a dozen Slack alerts.

## How These Metrics Show Up On Our Demo

Every metric in this post is exercised on our [interactive voice demo](/demo) — try interrupting the agent mid-sentence, ask for a price the system should not know, mumble a date. The replay pipeline grades that session offline overnight and the result lands in the same dashboard our engineers ship against. The fact that all six [vertical industries](/industries) (healthcare, real estate, sales, salon, IT helpdesk, after-hours) share the same metric model is what lets us roll a model upgrade across all of them with one decision.

## Frequently Asked Questions

### Which metric is the single best predictor of retention?

In our data, **containment rate** correlates most strongly with renewals (0.71). Among technical metrics, **TTFA p95** is the strongest predictor of within-call abandonment (correlation -0.62 with completion). Counterintuitively, correctness scores correlate less strongly with retention than latency does — users will forgive a wrong answer faster than they will forgive a slow one, as long as the agent recovers gracefully.

### How do I weight the metrics into a single score?

Don't, except for narrow purposes. A composite "quality score" hides exactly the layered information that makes the metric model useful. We do publish a single number for executive dashboards — a weighted average where containment is 40%, correctness 25%, latency 20%, CSAT proxy 15% — but engineers debug against the per-layer breakdown.

### How often should I re-label the reference dataset?

Quarterly for slow-moving domains (insurance, healthcare scheduling), monthly for fast-moving ones (sales scripts, promotions). The signal that you're overdue: judge-vs-human agreement drops, or eval scores look great while user CSAT drifts down.

### Should I measure prosody at all on a unified audio model?

Yes, especially after voice changes or model upgrades. The unified models occasionally regress on prosody for specific syntactic patterns (long appositive phrases, nested questions) without affecting any other metric. A 5-minute prosody check is cheap insurance.

### What's the smallest viable metric set for a new deployment?

WER, correctness, TTFA p95, barge-in success, and containment. Five metrics, five instrumentation points, covers ~80% of the failure modes you'll see in the first three months. Add the rest as you scale past 10k sessions/month.
```

---

Source: https://callsphere.ai/blog/voice-agent-quality-metrics-wer-latency-grounding