Voice AI Latency Under Load: CallSphere <1s vs Vapi Spikes
CallSphere targets sub-1-second voice latency via OpenAI Realtime + server VAD. Vapi reports multi-second spikes under load. Architecture deep dive.
TL;DR
CallSphere targets sub-1-second end-to-end voice latency by running directly on the OpenAI Realtime API over WebSocket with server-side VAD, PCM16 24kHz audio, and a single low-jitter pipeline. Vapi.ai claims <500ms in marketing and developers report occasional multi-second hangs under load, driven by the platform's multi-vendor pipeline (separate STT, LLM, TTS providers stitched together). The architectural difference is straightforward: every additional vendor in the path adds queueing, retries, and jitter. CallSphere collapses the path. This post breaks down the latency budget for both architectures.
What "Voice Latency" Actually Means
Voice AI latency is not one number. It is at least four numbers:
- Time-to-first-byte (TTFB) of audio after the user finishes speaking.
- End-of-utterance to first-spoken-word of the agent.
- Median jitter turn-over-turn during a normal call.
- P99 latency under concurrent load.
The first two are what marketing pages quote. The last two are what kills production deployments. A platform that hits 400 ms median but 4 seconds at P99 will be unusable during peak hours, even though its homepage looks great.
The right way to evaluate is to measure all four under realistic concurrency. The architectural choices a platform makes determine what those four numbers can possibly look like.
Vapi's Pipeline Architecture
Vapi.ai is a stitching layer on top of best-of-breed third-party voice components. A typical Vapi call runs:
- Telephony (Twilio or a Vapi-pooled number).
- STT (Deepgram, AssemblyAI, or a Vapi default).
- LLM (OpenAI, Anthropic, Groq, or another).
- TTS (ElevenLabs, PlayHT, Cartesia).
- Vapi orchestrator that coordinates the above.
Each hop is a separate API call across separate vendor networks. Each vendor has its own queue, its own throttling, and its own occasional regional issues. The architecture is flexible — you can swap providers — but the latency budget is the sum of the worst-case hops, not the best case.
Under load, the failure mode is tail latency cascading across vendors. If Deepgram is hot-spotting in us-east-1 and ElevenLabs has a brief queue, your Vapi call sees both. Developers have reported multi-second hangs on busy days, particularly when the upstream LLM is a different model than the one Vapi was tuned against.
CallSphere's Pipeline Architecture
CallSphere collapses the pipeline into the OpenAI Realtime API directly:
- Telephony (Twilio for inbound/outbound; WebRTC for browser-to-agent).
- OpenAI Realtime API over a single WebSocket connection. The Realtime API ingests audio, runs speech recognition, runs the LLM, and emits TTS audio in the same session.
- CallSphere FastAPI backend that handles tool calls, persistence, and analytics.
The Realtime API uses server-side VAD for turn detection, PCM16 24kHz audio framing for low-bitrate clarity, and the gpt-4o-realtime-preview-2025-06-03 model. Because STT, LLM, and TTS run inside one OpenAI session, there are no vendor hops in the critical path. The latency floor is bounded by the Realtime API itself, not by the slowest of three independent providers.
CallSphere's measured target end-to-end is <1 second from user-end-of-speech to agent-first-word, with median much lower. The reason it can hit that consistently under load is that the architecture has no junctions where vendor queues stack up.
Latency Budget Breakdown
Below is a typical latency budget for both architectures during a normal call.
| Stage | Vapi (multi-vendor) | CallSphere (Realtime) |
|---|---|---|
| Telephony ingest | 50-100 ms | 50-100 ms |
| STT processing | 150-300 ms | bundled |
| Network hop to LLM | 50-150 ms | bundled |
| LLM first token | 200-500 ms | 200-400 ms |
| Network hop to TTS | 50-150 ms | bundled |
| TTS first chunk | 200-400 ms | 100-200 ms |
| Telephony egress | 50-100 ms | 50-100 ms |
| Total median | 750-1700 ms | 400-800 ms |
| P99 under load | 2000-4500 ms | 800-1500 ms |
The numbers are illustrative based on public benchmarks and developer reports, not a controlled lab test. The relevant point is the shape — Vapi pays a hop cost three times that CallSphere does not.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Mermaid: Latency Budget Comparison
graph LR
subgraph Vapi
A1[User speaks] --> A2[Telephony 50-100ms]
A2 --> A3[STT 150-300ms]
A3 --> A4[Network to LLM 50-150ms]
A4 --> A5[LLM 200-500ms]
A5 --> A6[Network to TTS 50-150ms]
A6 --> A7[TTS 200-400ms]
A7 --> A8[Telephony 50-100ms]
A8 --> A9[User hears reply]
end
subgraph CallSphere
B1[User speaks] --> B2[Telephony 50-100ms]
B2 --> B3[OpenAI Realtime 300-600ms]
B3 --> B4[Telephony 50-100ms]
B4 --> B5[User hears reply]
end
Why PCM16 24kHz Matters
Audio format choice matters more than most teams realize. Vapi's default pipeline often re-encodes audio between providers — Twilio mu-law in, Deepgram PCM, ElevenLabs MP3 streaming out, telephony mu-law back. Each re-encode adds buffer delay (typically 20-60 ms) and a small quality hit.
CallSphere uses PCM16 at 24kHz end-to-end inside the Realtime session. PCM16 is a raw, uncompressed format, so there is no codec dwell time. 24kHz is high enough to preserve consonant clarity (which dominates intelligibility) without the bitrate overhead of 48kHz studio. The result: lower buffering latency and crisper audio, especially on flaky networks.
Why Server-Side VAD Matters Under Load
Voice Activity Detection (VAD) is what tells the platform when the user has stopped speaking. Two approaches exist:
- Client-side VAD: the browser or telephony layer detects silence and forwards a "user done" signal. Cheap, but jittery.
- Server-side VAD: the model itself decides when the user is done based on the audio stream. More accurate, harder to implement.
OpenAI Realtime ships server-side VAD as a first-class feature. CallSphere uses it, which means turn boundaries are detected by the same model that will respond, with no extra coordination overhead. Vapi-based stacks typically rely on the STT vendor's VAD signal, which adds round-trip and is more sensitive to background noise.
Under load, server-side VAD also avoids a class of bug where two specialist vendors disagree about whether the user has stopped speaking. CallSphere does not have to mediate that disagreement — there is only one decider.
Head-to-Head: Latency Architecture
| Dimension | CallSphere | Vapi |
|---|---|---|
| Vendor hops in path | 1 (OpenAI Realtime) | 3-4 (STT + LLM + TTS) |
| Audio format | PCM16 24kHz end-to-end | Re-encodes between vendors |
| VAD location | Server (OpenAI Realtime) | Usually STT-side |
| Median latency target | <1 second | <500ms claimed |
| P99 under load | 800-1500 ms | Multi-second reports |
| Recovery from vendor outage | OpenAI fail-over | Each vendor independent |
Why Concurrency Concentrates The Difference
The latency difference is largest under concurrency. When 50 calls are happening simultaneously:
- Vapi: each call hits 3-4 vendor APIs. If any vendor's queue is hot, all calls suffer. Tail latency expands.
- CallSphere: each call is one OpenAI Realtime session. The Realtime API has its own concurrency limits, but they are predictable and controllable.
For a small business with 5 simultaneous calls, both architectures perform well. For a 100-call concurrent inbound spike, the architectural differences dominate the user experience.
Practical Steps to Test Latency Yourself
If you are evaluating voice AI platforms, run a load test that mirrors production:
- Place 20 simultaneous calls.
- Record the time from user-end-of-utterance to agent-first-word for each turn.
- Plot the distribution. Look at P50, P95, and P99.
- Repeat at peak hours of the platform's region (typically US business hours UTC-7 to UTC-4).
You will see the architectural truth quickly. Marketing claims do not survive a real load test, but architecture does.
FAQ
Is <500ms latency on Vapi a fair claim?
The number is achievable in best-case isolated tests with the right vendor combination. Sustained P99 under concurrent load tells a different story.
Why does CallSphere not advertise 300ms?
Honest measurement under load matters more than a glossy headline number. CallSphere's <1-second target is a real production figure measured over real customer calls, not a marketing artifact.
What about WebRTC? Does it help?
Yes for some verticals. CallSphere's Real Estate platform uses WebRTC for browser-to-agent calls, which removes the telephony hop entirely. See the features page for the WebRTC architecture details.
Does the OpenAI Realtime API have its own latency issues?
It can spike during model rollouts. CallSphere version-pins the model (gpt-4o-realtime-preview-2025-06-03) to avoid surprise drift, and falls back to the previous version automatically during incidents.
How does this affect international calls?
Both platforms are sensitive to geography because OpenAI's regional endpoints matter. CallSphere routes traffic through the closest healthy Realtime endpoint to keep the budget tight.
Can I see latency metrics for my deployment?
Yes. CallSphere ships per-call latency metrics in the analytics dashboard. Book a demo and we will show you a live latency timeline for a real call.
Ship a Voice AI Stack That Stays Fast Under Load
Schedule a CallSphere demo and run your own load test. The architectural difference shows up in the first 10 calls.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.