ASR Accuracy Monitoring (WER) for Voice AI in Production - 2026
Vendors quote 5% WER on benchmarks. In production AI voice, the same model hits 15% to 20% on noisy mobile calls and 50% in healthcare. Here is the live monitoring stack that catches drift before users notice.
A vendor's 5% WER benchmark is a clean dictation in a quiet room. Your AI voice agent runs on cell-phone calls in salons with a hairdryer, in hospitals with intercoms, in cars at 70 mph. The same model that scored 5% on benchmarks scores 15-20% on real production traffic. Without continuous WER monitoring you will not know your accuracy until your conversion rate drops.
What goes wrong
Vendors publish single-WER numbers from controlled audio. Real telephony adds 8 kHz narrowband, codec compression, background noise, accented speakers, and overlapping turns. Same provider can deliver 92% on headsets, 78% in conference rooms, 65% on noisy mobile (Deepgram's own data). Worse, WER drifts: a model update by your STT vendor can move WER by 3-5 points overnight without notice.
The second trap is using WER alone. WER counts word-level edits. "Schedule for tomorrow at 3" -> "Schedule for tomorrow at 2" is one substitution but a complete failure. Semantic Error Rate (SER) and intent-correctness are better signals for AI voice.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
How to detect
Sample 1-3% of calls per tenant. Run a high-quality reference model (Whisper-large-v3 or a paid commercial reference) on the recording and compare against your production STT output. Compute WER, character error rate, intent-match rate, and SER. Track P50/P95 per tenant per agent per day. Alert when 7-day rolling P50 WER moves >2 points or P95 moves >5 points.
flowchart TD
A[Production STT - per turn] --> B[Persist transcript + audio clip]
B --> C{Sample 1-3%?}
C -->|Yes| D[Run reference model - Whisper-large]
D --> E[Diff: WER, CER, SER, intent match]
E --> F[Persist accuracy_samples]
F --> G[Daily WER P50/P95 dashboard]
G --> H{Drift > 2pt?}
H -->|Yes| I[Alert - vendor regression]
CallSphere implementation
CallSphere monitors STT accuracy live across all six verticals. Each of our 37 agents calls one of 90+ tools, and every transcript is persisted with confidence scores into 115+ DB tables. We sample 2% of calls per tenant per day, run Whisper-large-v3 as our reference, and store WER + SER + intent-match for trend analysis. Healthcare tenants on /industries/healthcare get 5% sampling for compliance. Twilio handles the carrier audio; Deepgram and OpenAI handle production STT; we monitor both. Starter ($149/mo) gets vendor WER alerts; Growth ($499/mo) and Scale ($1499/mo) get per-agent SER and intent-match. 14-day trial. Affiliates 22%.
Build steps
- Persist every STT turn with audio_url, vendor_transcript, vendor_confidence, agent_id, tenant_id.
- Build a sampler that pulls 2% per (tenant, agent, day) and queues to a reference STT job.
- Run Whisper-large-v3 (or a commercial high-WER reference) on the audio clips.
- Compute WER (Levenshtein on words), CER, and SER (semantic similarity via embeddings).
- Persist to accuracy_samples and roll up to daily aggregates.
- Build a Grafana dashboard: vendor WER vs reference WER per agent per day with 7-day rolling band.
- Alert when rolling P50 moves >2pt or P95 moves >5pt.
FAQ
Why not just trust vendor confidence scores? Vendor confidence is well-calibrated for the vendor's own model but useless for cross-vendor comparison. Run a reference and compute true WER.
What sampling rate is enough? Deepgram recommends 30 minutes to 3 hours of audio for a stable WER baseline. 1-3% per tenant per day usually clears that threshold for active tenants.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
What is good production WER? Below 10% is excellent, 10-15% is acceptable, 15-20% is degraded, above 20% needs intervention. Healthcare conversational WER often runs 25-50% even with strong models.
Should I switch vendors when WER spikes? Not always. Spikes often correlate to a code change in your turn-taking logic or audio pre-processing. Investigate before swapping.
How does SER differ from WER? WER counts edits at word level. SER counts whether meaning is preserved. "3 PM" -> "3 AM" is one word edit but a full failure. Embed both transcripts and compare cosine similarity.
Sources
- Deepgram - Speech Recognition Accuracy: Production Metrics
- Deepgram - Semantic Error Rate
- Google Cloud - Measure Speech Accuracy
- Speechmatics - Accuracy Benchmarking
Start a 14-day trial with WER monitoring, see pricing for SER on Growth, or book a demo. Healthcare on /industries/healthcare gets 5% sampling; partners earn 22% via the affiliate program.
## ASR Accuracy Monitoring (WER) for Voice AI in Production - 2026: production view ASR Accuracy Monitoring (WER) for Voice AI in Production - 2026 forces a tension most teams underestimate: agent handoff state. A single LLM call is easy. A booking agent that hands a confirmed slot to a billing agent that hands a follow-up to an escalation agent — that's where context loss, hallucinated IDs, and double-bookings live. Solving it well means treating the conversation as a stateful workflow, not a chat. ## Shipping the agent to production Production AI agents live or die on three loops: evals, retries, and handoff state. CallSphere runs **37 agents** across 6 verticals, each with its own eval suite — synthetic call transcripts replayed nightly with assertion checks on extracted entities (date, time, party size, insurance, address). Without that loop, prompt regressions ship silently and you only find out when bookings drop. Structured tools beat free-form text every time. Our **90+ function tools** all enforce JSON schemas validated server-side; if the model hallucinates an integer where a string is required, we retry with a corrective system message before falling back to a deterministic path. For long-running flows, we treat agent handoffs as a state machine — booking → confirmation → SMS — so context survives turn boundaries. The Realtime API vs. async decision usually comes down to "is the user holding the phone right now?" If yes, Realtime; if no (callback queue, after-hours voicemail), async wins on cost-per-conversation, which we track per agent in **115+ database tables** spanning all 6 verticals. ## FAQ **What's the right way to scope the proof-of-concept?** Real Estate runs as a 6-container pod (frontend, gateway, ai-worker, voice-server, NATS event bus, Redis) backed by Postgres `realestate_voice` with row-level security so multi-tenant data never crosses tenants. For a topic like "ASR Accuracy Monitoring (WER) for Voice AI in Production - 2026", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations. **How do you handle compliance and data isolation?** Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar. **When does it make sense to switch from a managed model to a self-hosted one?** The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer. ## Talk to us Want to see how this maps to your stack? Book a live walkthrough at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting), or try the vertical-specific demo at [salon.callsphere.tech](https://salon.callsphere.tech). 14-day trial, no credit card, pilot live in 3–5 business days.Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.