Skip to content
Technical Guides
Technical Guides13 min read0 views

Voice AI Latency Under Load: CallSphere <1s vs Vapi Spikes

CallSphere targets sub-1-second voice latency via OpenAI Realtime + server VAD. Vapi reports multi-second spikes under load. Architecture deep dive.

TL;DR

CallSphere targets sub-1-second end-to-end voice latency by running directly on the OpenAI Realtime API over WebSocket with server-side VAD, PCM16 24kHz audio, and a single low-jitter pipeline. Vapi.ai claims <500ms in marketing and developers report occasional multi-second hangs under load, driven by the platform's multi-vendor pipeline (separate STT, LLM, TTS providers stitched together). The architectural difference is straightforward: every additional vendor in the path adds queueing, retries, and jitter. CallSphere collapses the path. This post breaks down the latency budget for both architectures.

What "Voice Latency" Actually Means

Voice AI latency is not one number. It is at least four numbers:

  • Time-to-first-byte (TTFB) of audio after the user finishes speaking.
  • End-of-utterance to first-spoken-word of the agent.
  • Median jitter turn-over-turn during a normal call.
  • P99 latency under concurrent load.

The first two are what marketing pages quote. The last two are what kills production deployments. A platform that hits 400 ms median but 4 seconds at P99 will be unusable during peak hours, even though its homepage looks great.

The right way to evaluate is to measure all four under realistic concurrency. The architectural choices a platform makes determine what those four numbers can possibly look like.

Vapi's Pipeline Architecture

Vapi.ai is a stitching layer on top of best-of-breed third-party voice components. A typical Vapi call runs:

  1. Telephony (Twilio or a Vapi-pooled number).
  2. STT (Deepgram, AssemblyAI, or a Vapi default).
  3. LLM (OpenAI, Anthropic, Groq, or another).
  4. TTS (ElevenLabs, PlayHT, Cartesia).
  5. Vapi orchestrator that coordinates the above.

Each hop is a separate API call across separate vendor networks. Each vendor has its own queue, its own throttling, and its own occasional regional issues. The architecture is flexible — you can swap providers — but the latency budget is the sum of the worst-case hops, not the best case.

Under load, the failure mode is tail latency cascading across vendors. If Deepgram is hot-spotting in us-east-1 and ElevenLabs has a brief queue, your Vapi call sees both. Developers have reported multi-second hangs on busy days, particularly when the upstream LLM is a different model than the one Vapi was tuned against.

CallSphere's Pipeline Architecture

CallSphere collapses the pipeline into the OpenAI Realtime API directly:

  1. Telephony (Twilio for inbound/outbound; WebRTC for browser-to-agent).
  2. OpenAI Realtime API over a single WebSocket connection. The Realtime API ingests audio, runs speech recognition, runs the LLM, and emits TTS audio in the same session.
  3. CallSphere FastAPI backend that handles tool calls, persistence, and analytics.

The Realtime API uses server-side VAD for turn detection, PCM16 24kHz audio framing for low-bitrate clarity, and the gpt-4o-realtime-preview-2025-06-03 model. Because STT, LLM, and TTS run inside one OpenAI session, there are no vendor hops in the critical path. The latency floor is bounded by the Realtime API itself, not by the slowest of three independent providers.

CallSphere's measured target end-to-end is <1 second from user-end-of-speech to agent-first-word, with median much lower. The reason it can hit that consistently under load is that the architecture has no junctions where vendor queues stack up.

Latency Budget Breakdown

Below is a typical latency budget for both architectures during a normal call.

Stage Vapi (multi-vendor) CallSphere (Realtime)
Telephony ingest 50-100 ms 50-100 ms
STT processing 150-300 ms bundled
Network hop to LLM 50-150 ms bundled
LLM first token 200-500 ms 200-400 ms
Network hop to TTS 50-150 ms bundled
TTS first chunk 200-400 ms 100-200 ms
Telephony egress 50-100 ms 50-100 ms
Total median 750-1700 ms 400-800 ms
P99 under load 2000-4500 ms 800-1500 ms

The numbers are illustrative based on public benchmarks and developer reports, not a controlled lab test. The relevant point is the shape — Vapi pays a hop cost three times that CallSphere does not.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Mermaid: Latency Budget Comparison

graph LR
    subgraph Vapi
    A1[User speaks] --> A2[Telephony 50-100ms]
    A2 --> A3[STT 150-300ms]
    A3 --> A4[Network to LLM 50-150ms]
    A4 --> A5[LLM 200-500ms]
    A5 --> A6[Network to TTS 50-150ms]
    A6 --> A7[TTS 200-400ms]
    A7 --> A8[Telephony 50-100ms]
    A8 --> A9[User hears reply]
    end
    subgraph CallSphere
    B1[User speaks] --> B2[Telephony 50-100ms]
    B2 --> B3[OpenAI Realtime 300-600ms]
    B3 --> B4[Telephony 50-100ms]
    B4 --> B5[User hears reply]
    end

Why PCM16 24kHz Matters

Audio format choice matters more than most teams realize. Vapi's default pipeline often re-encodes audio between providers — Twilio mu-law in, Deepgram PCM, ElevenLabs MP3 streaming out, telephony mu-law back. Each re-encode adds buffer delay (typically 20-60 ms) and a small quality hit.

CallSphere uses PCM16 at 24kHz end-to-end inside the Realtime session. PCM16 is a raw, uncompressed format, so there is no codec dwell time. 24kHz is high enough to preserve consonant clarity (which dominates intelligibility) without the bitrate overhead of 48kHz studio. The result: lower buffering latency and crisper audio, especially on flaky networks.

Why Server-Side VAD Matters Under Load

Voice Activity Detection (VAD) is what tells the platform when the user has stopped speaking. Two approaches exist:

  • Client-side VAD: the browser or telephony layer detects silence and forwards a "user done" signal. Cheap, but jittery.
  • Server-side VAD: the model itself decides when the user is done based on the audio stream. More accurate, harder to implement.

OpenAI Realtime ships server-side VAD as a first-class feature. CallSphere uses it, which means turn boundaries are detected by the same model that will respond, with no extra coordination overhead. Vapi-based stacks typically rely on the STT vendor's VAD signal, which adds round-trip and is more sensitive to background noise.

Under load, server-side VAD also avoids a class of bug where two specialist vendors disagree about whether the user has stopped speaking. CallSphere does not have to mediate that disagreement — there is only one decider.

Head-to-Head: Latency Architecture

Dimension CallSphere Vapi
Vendor hops in path 1 (OpenAI Realtime) 3-4 (STT + LLM + TTS)
Audio format PCM16 24kHz end-to-end Re-encodes between vendors
VAD location Server (OpenAI Realtime) Usually STT-side
Median latency target <1 second <500ms claimed
P99 under load 800-1500 ms Multi-second reports
Recovery from vendor outage OpenAI fail-over Each vendor independent

Why Concurrency Concentrates The Difference

The latency difference is largest under concurrency. When 50 calls are happening simultaneously:

  • Vapi: each call hits 3-4 vendor APIs. If any vendor's queue is hot, all calls suffer. Tail latency expands.
  • CallSphere: each call is one OpenAI Realtime session. The Realtime API has its own concurrency limits, but they are predictable and controllable.

For a small business with 5 simultaneous calls, both architectures perform well. For a 100-call concurrent inbound spike, the architectural differences dominate the user experience.

Practical Steps to Test Latency Yourself

If you are evaluating voice AI platforms, run a load test that mirrors production:

  1. Place 20 simultaneous calls.
  2. Record the time from user-end-of-utterance to agent-first-word for each turn.
  3. Plot the distribution. Look at P50, P95, and P99.
  4. Repeat at peak hours of the platform's region (typically US business hours UTC-7 to UTC-4).

You will see the architectural truth quickly. Marketing claims do not survive a real load test, but architecture does.

FAQ

Is <500ms latency on Vapi a fair claim?

The number is achievable in best-case isolated tests with the right vendor combination. Sustained P99 under concurrent load tells a different story.

Why does CallSphere not advertise 300ms?

Honest measurement under load matters more than a glossy headline number. CallSphere's <1-second target is a real production figure measured over real customer calls, not a marketing artifact.

What about WebRTC? Does it help?

Yes for some verticals. CallSphere's Real Estate platform uses WebRTC for browser-to-agent calls, which removes the telephony hop entirely. See the features page for the WebRTC architecture details.

Does the OpenAI Realtime API have its own latency issues?

It can spike during model rollouts. CallSphere version-pins the model (gpt-4o-realtime-preview-2025-06-03) to avoid surprise drift, and falls back to the previous version automatically during incidents.

How does this affect international calls?

Both platforms are sensitive to geography because OpenAI's regional endpoints matter. CallSphere routes traffic through the closest healthy Realtime endpoint to keep the budget tight.

Can I see latency metrics for my deployment?

Yes. CallSphere ships per-call latency metrics in the analytics dashboard. Book a demo and we will show you a live latency timeline for a real call.

Ship a Voice AI Stack That Stays Fast Under Load

Schedule a CallSphere demo and run your own load test. The architectural difference shows up in the first 10 calls.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.