Skip to content
AI Engineering
AI Engineering11 min read0 views

Domain Adaptation for AI Voice Agents (Vocabulary, ASR, TTS) in 2026

Mispronouncing 'metformin' destroys caller trust in 30 seconds. Domain adaptation drops Word Error Rate 2–30 points in healthcare and legal. We cover ASR vocabulary biasing, TTS pronunciation lexicons, and acoustic LoRA for voice agents.

TL;DR — A general ASR mispronounces "metformin" 12% of the time and "amortization" 18% of the time. Domain adaptation — vocabulary biasing on the ASR + a pronunciation lexicon on the TTS + (optionally) acoustic LoRA — drops Word Error Rate 2–30 points in regulated verticals. It's the difference between a real voice agent and a science demo.

What it does

Three layers need adaptation:

  1. ASR vocabulary biasing — at recognition time, boost the prior on a list of domain terms (drug names, plan names, internal codes).
  2. TTS pronunciation lexicon — explicit <phoneme> overrides for brand names and rare words.
  3. Acoustic LoRA (optional) — fine-tune the ASR or TTS encoder on domain audio if the dialect or recording channel is unusual (call-center mu-law audio, accented English, etc).

How it works

flowchart TD
  AUDIO[Caller audio] --> ASR[ASR with biasing]
  VOCAB[Domain phrase list] --> ASR
  ASR --> TEXT[Transcript]
  TEXT --> LLM[Agent LLM]
  LLM --> TTS_TEXT[Agent reply]
  TTS_TEXT --> TTS[TTS with lexicon]
  LEX[Pronunciation lexicon] --> TTS
  TTS --> SPEECH[Audio out]

CallSphere implementation

CallSphere runs 6 verticals · 37 agents with hard latency budgets (<800 ms TTFT). Domain adaptation we run today:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • Healthcare — vocabulary biasing on a 1,200-term list (drug names, ICD-10 phrases, plan names). WER on drug names down from 14% → 3%. Post-call analytics still on GPT-4o-mini but ASR feeds it cleaner text.
  • Behavioral Health — TTS lexicon for crisis hotline names ("988", "SAMHSA") + de-escalation phrasing aliases. Latency-sensitive: we use vocabulary biasing only, no acoustic LoRA.
  • Salon / Dental — TTS lexicon for hyphenated brand names ("Olaplex", "Invisalign") and stylist first names per location.
  • OneRoof real-estate (OpenAI Agents SDK) — vocabulary biasing on neighborhood + street name lists per market.
  • Behavioral Health PHI redactor — open-source LoRA on Whisper-large-v3 for telephony-band audio (8 kHz, mu-law).

Across 90+ tools · 115+ DB tables, the vocabulary list is generated from the database at deploy time — 90+ tools mean 90+ schema names that need to be pronounceable. Plans: $149 / $499 / $1,499, 14-day trial, 22% affiliate.

Build steps with code

# Deepgram domain biasing
deepgram.transcribe(
    audio,
    model="nova-3",
    keywords=["metformin:5","keytruda:5","prior-auth:3"],   # weight per term
    language="en-US",
)

# OpenAI Realtime + custom lexicon
session.update({
  "type":"session.update",
  "session":{
    "input_audio_transcription":{"model":"whisper-1"},
    "instructions":"Pronounce 'Olaplex' as oh-LAP-lex.",
  }
})

# Acoustic LoRA fine-tune of Whisper for telephony audio
from peft import LoraConfig, get_peft_model
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v3")
peft = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj","v_proj"], lora_dropout=0.05)
model = get_peft_model(model, peft)
# train on (mu-law-resampled-audio, transcript) pairs

Pitfalls

  • Over-biasing — too-high keyword weights cause the ASR to hallucinate domain terms in clean speech.
  • Lexicon drift — TTS lexicons rot; refresh quarterly or whenever a new product launches.
  • Forgetting per-region lists — neighborhood names, agent first names, and provider names are local. Generate per-tenant.
  • LoRA on the wrong layer — Whisper LoRA on attention QV is the safe default; targeting feed-forward breaks transcription.
  • Skipping noise-augmented training — call-center audio is noisy; train with reverb + telephony codec augmentation or your LoRA over-fits clean studio audio.

FAQ

Q: How big a vocabulary list is too big? Most ASRs handle 500–5,000 keywords cleanly. Past 5,000 you start eroding general performance.

Q: Vocabulary biasing vs acoustic LoRA? Biasing for terms; LoRA for accent/channel/dialect. They're orthogonal — combine them.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Q: Does this work with Whisper? Whisper supports prompt-based biasing (the initial_prompt parameter); acoustic LoRA needs HuggingFace + PEFT.

Q: How do I measure WER per vertical? Hold out 200 transcribed-and-corrected calls per vertical and compute WER weekly. Track per-term error rate too.

Q: When to skip acoustic LoRA? If your audio channel is clean and your accent distribution matches the base model's training data, biasing alone is enough.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

Agentic AI

Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss

The full metric set for evaluating production voice agents — STT word error rate, end-to-end latency budgets, RAG grounding, prosody, and the metrics that actually correlate with retention.

Agentic AI

Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)

Build a working voice agent with the OpenAI Realtime API + Agents SDK, then bolt on an eval pipeline that catches barge-in failures, hallucinated grounding, and latency regressions.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.