Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss
The full metric set for evaluating production voice agents — STT word error rate, end-to-end latency budgets, RAG grounding, prosody, and the metrics that actually correlate with retention.
TL;DR
Most voice agent teams measure two metrics — accuracy and latency — and call it done. That gets you to a working demo. It does not get you to a system that survives 50,000 calls a week without a clinic, a real estate office, or a help-desk customer noticing that something is off. The metric set that actually predicts retention is layered: STT, NLU, agent reasoning, TTS, and system, each with its own evaluators, each tied to user-facing outcomes like containment and CSAT. This post is the full metric model we run on CallSphere, with the formulas, instrumentation snippets, the comparison table that tells you which metric catches which class of bug, and the three metrics most teams skip until they get burned.
Why "Accuracy" Is Not a Metric
Senior engineers at voice-agent startups still tell me their bot is "92% accurate." On what task? Measured how? Against which dataset? On audio recorded in which acoustic conditions? Single-number quality claims are the surest sign a team has not yet hit production scale. Once you have, you learn quickly that voice agents fail in five distinct layers and a single number cannot catch any of them well.
The layered model:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- STT layer — did we hear the user correctly?
- NLU layer — did we understand the intent?
- Agent reasoning layer — did we choose the right action and produce a correct, grounded response?
- TTS layer — did the response sound natural and intelligible?
- System layer — did the whole stack respond fast enough and recover from interruptions?
On top of those five, user-facing metrics (containment, transfer rate, CSAT) tie the technical layers to the business outcome. A bug in any layer can kill the user-facing metric, which is why isolating the layer is the entire point.
The Layered Metric Pipeline
flowchart TD
A[Recorded session] --> S[STT layer]
A --> SYS[System spans]
S -->|transcript| N[NLU layer]
N -->|intent + slots| R[Agent reasoning]
R -->|response text| T[TTS layer]
S --> M1[WER, CER]
N --> M2[Intent F1, slot accuracy]
R --> M3[Correctness, faithfulness, tool acc]
T --> M4[MOS proxy, prosody, intelligibility]
SYS --> M5[p50/p95, TTFA, barge-in success]
M1 & M2 & M3 & M4 & M5 --> AGG[Per-call quality score]
AGG --> UF[Containment / Transfer / CSAT proxy]
style AGG fill:#ffd
style UF fill:#cfc
Figure 1 — Each technical layer produces its own metrics; the user-facing layer is what the business cares about. The pipeline is what lets you blame the right layer when CSAT drops.
Layer 1 — STT Metrics
The two foundational metrics are Word Error Rate and Character Error Rate:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
WER = (S + D + I) / N
Where S, D, I are substitutions, deletions, insertions against a human reference transcript and N is the reference word count. CER is the same formula at the character level — useful when you care about names, addresses, or alphanumeric strings (insurance IDs, license plates) where one misheard letter changes meaning.
We track both because they catch different bugs:
- WER spikes point to acoustic mismatch (new microphone class, new accent group in the user pool, codec change in the SIP bridge).
- CER spikes with steady WER point to entity-level errors — the model is hearing words right but mangling proper nouns. This is where domain-specific spelling biasing pays for itself.
A working WER implementation in TypeScript:
export function wordErrorRate(reference: string, hypothesis: string): number {
const r = reference.toLowerCase().split(/\s+/);
const h = hypothesis.toLowerCase().split(/\s+/);
const dp: number[][] = Array.from({ length: r.length + 1 }, () =>
Array(h.length + 1).fill(0)
);
for (let i = 0; i <= r.length; i++) dp[i][0] = i;
for (let j = 0; j <= h.length; j++) dp[0][j] = j;
for (let i = 1; i <= r.length; i++) {
for (let j = 1; j <= h.length; j++) {
const cost = r[i - 1] === h[j - 1] ? 0 : 1;
dp[i][j] = Math.min(
dp[i - 1][j] + 1,
dp[i][j - 1] + 1,
dp[i - 1][j - 1] + cost
);
}
}
return dp[r.length][h.length] / Math.max(r.length, 1);
}
\`\`\`
Our production target on the unified `gpt-realtime-2025-08-28` model: WER < 0.06 on English mixed-accent calls, < 0.09 on the bilingual Spanish/English subset.
## Layer 2 — NLU Metrics
If you're using a unified realtime model, NLU is implicit — there's no separate intent classifier to score. But for any agent that routes to specialists or branches behavior on detected intent, you need:
- **Intent F1** — precision and recall on a held-out labeled set, computed per intent class. Macro-averaged is the metric to publish; micro-averaged hides minority-class failures.
- **Slot accuracy** — for slot-filling agents (date, party size, insurance ID), the percentage of slots correctly extracted.
- **Confidence calibration** — does the model's stated confidence track empirical accuracy? We bucket predictions by confidence and compute reliability curves quarterly.
A miscalibrated confidence score is more dangerous than low accuracy, because it breaks every downstream "should we transfer to a human" rule.
## Layer 3 — Agent Reasoning Metrics
This is where most evaluation effort goes once the rest is stable. The metrics that matter:
- **Correctness** (LLM-as-judge against a reference answer) — did the agent give the right answer?
- **Faithfulness / Groundedness** — does every factual claim trace to a tool result or retrieved document? This is the metric that catches "the agent quoted a price that doesn't exist."
- **Tool-call correctness** — was the right tool called with the right arguments? Computed structurally against an expected call list.
- **Refusal appropriateness** — did the agent refuse when it should have, and not refuse when it shouldn't have?
A Python-side groundedness evaluator:
```python
from openai import OpenAI
client = OpenAI()
GROUNDING_PROMPT = """You are auditing a voice agent transcript.
For each agent statement asserting a fact (price, date, policy, name),
return JSON: {grounded: bool, evidence: str, claim: str}.
A claim is grounded ONLY if it is directly supported by the tool_results.
Agent transcript:
{transcript}
Tool results:
{tool_results}
"""
def grounding_score(transcript: str, tool_results: str) -> float:
resp = client.chat.completions.create(
model="gpt-4o-2024-08-06",
messages=[{"role": "user", "content": GROUNDING_PROMPT.format(
transcript=transcript, tool_results=tool_results
)}],
response_format={"type": "json_object"},
temperature=0,
)
claims = json.loads(resp.choices[0].message.content)["claims"]
return sum(c["grounded"] for c in claims) / max(len(claims), 1)
\`\`\`
Pin the judge model with a date stamp. Calibrate quarterly against a 50-row human-labeled subset. If judge-human agreement drops below 0.85, retrain or swap the judge.
## Layer 4 — TTS Metrics
Underrated and under-instrumented at most teams. The metrics:
- **MOS proxy** — Mean Opinion Score is traditionally collected from human raters; a neural MOS predictor (e.g., a small wav2vec2-based model fine-tuned on labeled MOS data) gives you a continuous proxy on every utterance for ~$0.0003 each.
- **Prosody score** — does the pitch contour match the punctuation? Questions should rise; statements should fall. We use a prosody-aware classifier scored 0–1.
- **Intelligibility** — round-trip the synthesized audio through a separate STT and compute WER against the intended text. If the TTS pronounces "fifteen" as "fifty," round-trip WER catches it.
- **Phoneme stress accuracy** — for names and brand terms, we maintain a pronunciation lexicon and score adherence.
Round-trip intelligibility is the cheapest and most useful of these. Every release should run it on a fixed phrase list:
```ts
const targetPhrase = "Your appointment is at 3:15 PM on Tuesday, May 6th.";
const synthesized = await tts(targetPhrase);
const recovered = await stt(synthesized);
const intelligibility = 1 - wordErrorRate(targetPhrase, recovered);
// Target: > 0.97 on standard phrasebook
\`\`\`
## Layer 5 — System Metrics
The plumbing layer. These are the metrics that predict abandonment more strongly than any reasoning metric:
| Metric | Definition | Our budget |
|---|---|---|
| Time-to-first-audio (TTFA) | Time from end-of-user-speech to first audio frame from agent | p50 ≤ 500 ms, p95 ≤ 800 ms |
| End-to-end latency | TTFA + response duration | p95 ≤ 3 s for short turns |
| Barge-in success rate | % of user interruptions where agent stops within 200 ms | ≥ 0.97 |
| Interruption recovery | % of post-barge-in turns where agent resumes the right task | ≥ 0.93 |
| Connection stability | Drops per 1000 sessions | < 4 |
| Tool latency p95 | Per-tool latency | varies, < 800 ms |
Barge-in success rate is the metric every team forgets and every user notices. We instrument it by sampling every 50th call and running an audio overlap detector offline.
## The User-Facing Layer
The technical layers exist to serve the user-facing layer. Three metrics that actually correlate with revenue:
- **Containment rate** — % of calls fully resolved by the agent without human transfer. For our [healthcare and after-hours](/industries) deployments, baseline is 68%; mature deployments hit 84%.
- **Transfer rate by reason** — when transfers happen, *why*. "Out of scope" is fine. "User frustration" is a quality alarm.
- **CSAT proxy** — we run a 5-second post-call survey on a sampled subset, plus a sentiment classifier on the full transcript corpus. The classifier-derived proxy correlates 0.81 with the survey CSAT, which is good enough to use as a continuous gauge.
## The Comparison Table — What Each Metric Catches
This is the table I print and put on the wall:
| Metric | What it catches | How to measure | Cost per 1k turns |
|---|---|---|---|
| WER | Acoustic mismatch, accent regressions | Levenshtein vs. human transcript | ~$0 (compute only) |
| CER | Entity-level mishearings | Char-level Levenshtein | ~$0 |
| Intent F1 | Routing failures | Confusion matrix on labeled set | ~$0 |
| Correctness | Wrong answers | LLM-as-judge | ~$2.40 |
| Groundedness | Hallucinated facts | LLM-as-judge over tool results | ~$3.10 |
| Tool-call correct | Wrong action taken | Structural diff on call args | ~$0 |
| MOS proxy | TTS quality regressions | Neural MOS predictor | ~$0.30 |
| Round-trip intelligibility | Mispronounced numbers/names | TTS → STT → WER | ~$0.40 |
| Prosody score | Robotic delivery | Classifier over pitch contour | ~$0.10 |
| TTFA p95 | Latency creep | Span timing on response.created | ~$0 |
| Barge-in success | User talks-over | Audio overlap detector | ~$0.05 |
| Containment | Business value | Session outcome label | ~$0 |
| CSAT proxy | User satisfaction | Sentiment classifier + survey | ~$0.20 |
The ones at the bottom — barge-in, prosody, intelligibility — are the most-skipped and the highest-leverage. Most teams add them only after a customer complaint forces it.
## Three Metrics Most Teams Skip Until They Get Burned
**1. Round-trip intelligibility.** I cannot count the number of voice agents I've heard say "your balance is fifty dollars" when the truth was "fifteen." A 30-line script catches every case.
**2. Confidence calibration.** If your routing rule says "transfer to human if confidence < 0.7" but your confidence scores are uncalibrated, the rule is noise. Run a reliability diagram quarterly.
**3. Interruption recovery.** Barge-in success is necessary but not sufficient. The harder question is: after the user interrupted you, did you correctly figure out what they wanted, or did you just stop talking and stand there? We measure this by labeling a sample of post-interruption turns as "recovered correctly" or not. Our number sits around 0.93; six months ago it was 0.78.
## Instrumentation — Where the Metrics Live
We hold a dual store: hot metrics in Datadog (TTFA, p95, barge-in success — anything that needs alerting), and cold metrics in a Postgres analytics schema joined against the LangSmith trace ID so any spike can be drilled to the underlying session. The eval replay runner from our [companion realtime build piece](/blog/openai-realtime-voice-agents-eval-pipeline-2026) writes its outputs into the same schema, so you can chart "WER on PR branch vs. WER on main over the last 30 days" with a SQL query.
If you build only one dashboard, build the per-layer breakdown for the last 24 hours, with each layer's score, the trend arrow against the prior 7-day average, and a click-through to the top 10 worst-scoring sessions per layer. That single view replaces a dozen Slack alerts.
## How These Metrics Show Up On Our Demo
Every metric in this post is exercised on our [interactive voice demo](/demo) — try interrupting the agent mid-sentence, ask for a price the system should not know, mumble a date. The replay pipeline grades that session offline overnight and the result lands in the same dashboard our engineers ship against. The fact that all six [vertical industries](/industries) (healthcare, real estate, sales, salon, IT helpdesk, after-hours) share the same metric model is what lets us roll a model upgrade across all of them with one decision.
## Frequently Asked Questions
### Which metric is the single best predictor of retention?
In our data, **containment rate** correlates most strongly with renewals (0.71). Among technical metrics, **TTFA p95** is the strongest predictor of within-call abandonment (correlation -0.62 with completion). Counterintuitively, correctness scores correlate less strongly with retention than latency does — users will forgive a wrong answer faster than they will forgive a slow one, as long as the agent recovers gracefully.
### How do I weight the metrics into a single score?
Don't, except for narrow purposes. A composite "quality score" hides exactly the layered information that makes the metric model useful. We do publish a single number for executive dashboards — a weighted average where containment is 40%, correctness 25%, latency 20%, CSAT proxy 15% — but engineers debug against the per-layer breakdown.
### How often should I re-label the reference dataset?
Quarterly for slow-moving domains (insurance, healthcare scheduling), monthly for fast-moving ones (sales scripts, promotions). The signal that you're overdue: judge-vs-human agreement drops, or eval scores look great while user CSAT drifts down.
### Should I measure prosody at all on a unified audio model?
Yes, especially after voice changes or model upgrades. The unified models occasionally regress on prosody for specific syntactic patterns (long appositive phrases, nested questions) without affecting any other metric. A 5-minute prosody check is cheap insurance.
### What's the smallest viable metric set for a new deployment?
WER, correctness, TTFA p95, barge-in success, and containment. Five metrics, five instrumentation points, covers ~80% of the failure modes you'll see in the first three months. Add the rest as you scale past 10k sessions/month.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.