PCM16 24kHz vs Vapi Pipeline: Voice Quality Deep Dive

TL;DR

CallSphere standardizes on PCM16 mono at 24kHz end-to-end against the OpenAI Realtime API. That sample rate hits the sweet spot of intelligibility, model accuracy, and latency. Telephony often runs at 8kHz mu-law and most voice AI products upsample at the last hop — losing quality and shipping fewer phonemes to the model. Vapi.ai abstracts the audio pipeline behind its hosted stack, which is convenient but opaque: you cannot always inspect frame timing, sample rate, or codec at each hop. This post unpacks the audio pipeline at every stage and explains why 24kHz PCM16 produces better intent recognition, fewer hallucinations, and tighter turn-taking than the typical 8kHz path.

What PCM16 24kHz Actually Means

PCM16 is uncompressed linear pulse-code modulation with 16-bit signed samples. 24kHz means 24,000 samples per second per channel. Mono means one channel. At those parameters, one second of audio is exactly 48,000 bytes — large by compressed standards, but trivially streamable over modern network links.

Why these numbers? The Nyquist theorem says 24kHz sampling captures frequencies up to 12kHz. Human speech has critical perceptual content up to roughly 8kHz, with sibilants ("s", "f", "th") reaching higher. 24kHz captures every speech-relevant frequency cleanly, while 8kHz telephony cuts everything above 4kHz — which is why "fifteen" and "fifty" are confusable on a phone call.

Vapi's Audio Pipeline (What We Can Verify)

Vapi handles audio inside its hosted pipeline. Your bring-your-own STT and TTS providers each have their preferred sample rates. Vapi normalizes between them. Most STT providers (Deepgram, AssemblyAI) accept 16kHz; most premium TTS (ElevenLabs, PlayHT) outputs at 22.05kHz or 44.1kHz. Vapi resamples between hops.

That resampling is invisible to you, the engineer. You don't see the audio frames. You see a "transcript" event and a "speech" event. When pronunciation goes wrong or a "1" gets transcribed as a "9," debugging is hard because you cannot inspect the raw frames the STT actually saw.

CallSphere's Pipeline at PCM16 24kHz

CallSphere standardizes on PCM16 24kHz from the moment the caller's audio reaches the gateway through the round-trip into OpenAI Realtime and back to the caller. There is no resampling at the model boundary, because the OpenAI Realtime API expects exactly that format. Fewer transformations means fewer artifacts and lower latency.

The model in production is gpt-4o-realtime-preview-2025-06-03. It accepts PCM16 24kHz natively, performs server-side VAD on those frames, and emits PCM16 24kHz back. Analytics and post-call analysis run on gpt-4o-mini against the captured transcript.

The end-to-end pipeline:

graph LR
    A[Caller / Phone or Browser] --> B[Twilio or WebRTC Gateway]
    B --> C[Frame Normalizer<br/>PCM16 24kHz mono]
    C --> D[OpenAI Realtime WS<br/>gpt-4o-realtime-2025-06-03]
    D --> E[Server VAD + LLM]
    E --> F[Tool Calls / Function-Calling]
    F --> G[TTS Frames PCM16 24kHz]
    G --> B
    B --> A
    E -.transcript.-> H[gpt-4o-mini Analytics]
    H --> I[(PostgreSQL)]

Every arrow on the audio path is the same format. No quality loss between hops.

Sample Rate vs Intelligibility — The Numbers

Sample Rate	Captures up to	Use Case	Perceived Quality
8kHz mu-law	4kHz	PSTN telephony	Phone-call clarity
16kHz	8kHz	Most STT providers	Acceptable for transcription
22.05kHz	11kHz	Older TTS, music demos	Good
24kHz	12kHz	OpenAI Realtime, modern voice AI	Excellent
44.1kHz	22kHz	Music CD quality	Overkill for speech
48kHz	24kHz	Studio recording	Overkill for speech

24kHz is the right floor for voice AI because it captures the sibilant range that telephony 8kHz mangles. "Six" vs "fix," "fifteen" vs "fifty," "S" vs "F" all become reliably distinguishable. Function-calling tools that pass user-spoken numbers back to a backend benefit directly: fewer mistranscribed digits, fewer wrong appointments booked.

Why Bit Depth Matters Less Than You Think

PCM16 is 16-bit signed integer samples. The dynamic range is about 96dB, which exceeds the ~60dB of typical speech. You can technically downshift to PCM8 mu-law for half the bandwidth, but mu-law's logarithmic compression introduces artifacts under aggressive AGC and noise suppression. PCM16 is the lingua franca of modern voice AI for a reason.

24-bit float is the studio standard, but the marginal accuracy gain at the model layer is invisible — and most network paths cannot stream 24-bit float at scale anyway.

End-to-End Latency Budget at 24kHz

Hop	Typical Time
Caller mic to gateway (WebRTC)	30-80ms
Gateway to OpenAI Realtime (WS)	20-60ms
Server VAD + first model token	200-400ms
First TTS frame back to gateway	100-300ms
Gateway to caller speaker	30-80ms
Total target	<1000ms

That under-1-second budget is what makes the conversation feel natural. PCM16 24kHz is part of why we can hit it — there is no resampling tax. Vapi can hit similar latencies but you have less visibility into where the budget is spent.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Code Sketch: Frame Sizing

const SAMPLE_RATE = 24000;
const FRAME_MS = 100;
const FRAME_SAMPLES = SAMPLE_RATE * FRAME_MS / 1000; // 2400
const FRAME_BYTES = FRAME_SAMPLES * 2; // 4800 bytes per frame

100ms frames are the standard cadence into the OpenAI Realtime API. Bigger frames (200-500ms) lower per-frame overhead but raise floor latency. CallSphere's voice server defaults to 100ms because it lines up with the model's VAD cadence.

Common Quality Pitfalls and How CallSphere Avoids Them

Pitfall 1: Resample at every hop. Each resample is a small quality hit. Standardizing on 24kHz from gateway to model eliminates it.

Pitfall 2: Mismatched mono/stereo. OpenAI Realtime expects mono. Sending stereo causes silent failure or garbled output. CallSphere's frame normalizer enforces mono.

Pitfall 3: AGC stacking. Caller's phone applies AGC, the gateway applies AGC, the noise-suppressor applies AGC. The result is pumping. CallSphere applies one AGC at the gateway and disables downstream.

Pitfall 4: Endianness drift. PCM16 is little-endian on the wire for OpenAI Realtime. Big-endian audio shows up as static. The frame normalizer enforces it.

Where Vapi's Black Box Hurts

When a Vapi call goes wrong — wrong number transcribed, wrong appointment booked — you have a transcript. You don't have the audio frames the STT saw. You can't tell if the issue was upstream noise, codec resampling, or model error. CallSphere stores the raw PCM16 frames (gated on consent and retention policy) so engineers can replay any failed turn and pinpoint the layer that introduced the artifact.

FAQ

Why not 16kHz like most STT providers?

The OpenAI Realtime API expects 24kHz. Going 16kHz means upsampling at the model boundary, which introduces interpolation artifacts and adds latency. We standardize to the model's native rate.

Does PCM16 24kHz cost more bandwidth than Opus?

Yes — about 6x more. But PCM16 is on the inside of our pipeline (gateway to model), where bandwidth is cheap. The caller-facing leg uses Opus over WebRTC or compressed PSTN, so the user experience is unaffected.

Can CallSphere accept other sample rates?

Yes. The frame normalizer at the gateway resamples 8kHz mu-law (PSTN) up to 24kHz with a high-quality polyphase filter. The boundary between caller and model is the only resample, and it happens once.

What about Opus vs PCM16 inside the data center?

Opus shines on lossy networks. Inside our service mesh, networks are not lossy, and PCM16 is decoder-free, which means lower CPU per frame and fewer artifacts.

Does this affect call recording quality?

Yes. CallSphere recordings are stored at 24kHz PCM16, which produces noticeably clearer playback and more accurate post-call analytics by the gpt-4o-mini analyzer.

Production Audit Tooling We Built Around the Pipeline

Standardizing on PCM16 24kHz unlocked a class of audit tooling that pays back the bandwidth cost. We built three internal tools on top of the consistent format. The first is a frame replayer — given any call ID, it pulls the stored PCM frames and replays them through a fresh agent session, showing exactly how the model would have responded with a different prompt. Engineers iterating on prompt quality use it before pushing changes to staging.

The second is a turn-by-turn diff viewer that aligns the caller's audio frames with the model's response audio frames on a shared timeline. When a customer reports a clipped reply or a mistimed barge-in, we open the diff viewer and pinpoint the millisecond. With Vapi-style abstractions, the same investigation usually ends in "we'll file a ticket with the platform."

The third is A/B audio quality scoring by gpt-4o-mini against the captured frames. The analyzer rates intelligibility on a 1-5 scale per turn, flags any turn below 3, and surfaces patterns (specific call routes, specific times of day, specific carriers). That single tool flagged a 6kHz upper-frequency dropout on a particular carrier last quarter; the fix was a one-line tap re-config at the gateway. Without consistent PCM16 24kHz, the pattern would not have been visible.

Tradeoffs and When You Wouldn't Pick This

Standardizing on PCM16 24kHz inside the data center costs ~6x more bandwidth than Opus would. For us, internal bandwidth is cheap and the engineering benefits dominate. If you are running on a tight bandwidth budget across regions, you would Opus-encode at the gateway and decode at the model. We don't, because the artifact-free pipeline is worth more than the bytes.

If your STT provider does not support 24kHz, you also cannot get this benefit; you'll hit a forced downsample. The OpenAI Realtime API's native support for 24kHz is the load-bearing assumption that makes the pattern work.

Try CallSphere

Hear the 24kHz pipeline in action. Book a demo or compare features.