Cold-Start Voice AI Performance: CallSphere vs Vapi Benchmarks
Detailed cold-start benchmarks for voice AI: WebSocket setup, model warmup, first-token latency. Compare CallSphere on K8s vs Vapi managed pipeline.
TL;DR
Cold start in voice AI is the time from the first SIP RING to the first agent token spoken. It matters most when call volume is bursty (think clinic morning rush, real estate Saturday surge, after-hours storm). Vapi ships a managed warm pool, which gives you a smooth ~400-600ms cold start at the cost of opacity. CallSphere runs on K8s with hostPath hot-reload, an OpenAI Realtime WebSocket pre-warmed per pod, and Twilio media streams; cold start is ~700ms-1.1s for the first call into a freshly scaled pod, ~250-400ms thereafter.
If you can predict surge, CallSphere's HPA (Horizontal Pod Autoscaler) plus a pre-warm sidecar gets you the same numbers as Vapi with full transparency.
What "Cold Start" Actually Means in Voice AI
Three things have to happen before the agent can speak:
- Telephony attach — Twilio or your SIP trunk has to bridge media to your application.
- Realtime session establish — open a WebSocket to OpenAI Realtime, send the session.update with system prompt, voice, and tools, and receive the session.created event.
- First-token generation — once audio starts flowing, the model has to emit its first audible token.
Each adds latency. In a steady-state call (#2 already pre-warmed), only #1 and #3 contribute. In a true cold start (#2 not pre-warmed), all three stack.
Vapi Cold-Start Approach
Vapi runs a managed warm pool of LLM connections. When a new call lands:
- Their SIP gateway picks an existing warm worker
- The worker has an OpenAI/Anthropic connection already open
- Their reported time-to-first-audio is sub-500ms in their docs
Trade-offs:
- You do not control pool size
- Burst beyond pool capacity adds queue time you cannot inspect
- Latency spikes during cross-region failover
- No way to pre-warm by your own forecast
CallSphere Cold-Start Approach
CallSphere runs on k3s with hostPath volumes for backend hot-reload. The voice path is:
Twilio Media Streams (WebSocket)
↓
Python FastAPI agent server (per-pod)
↓
OpenAI Realtime API (WebSocket, gpt-4o-realtime-preview-2025-06-03)
Each pod boots with a prewarmer sidecar: it opens an OpenAI Realtime WebSocket, sends a no-op session.update, and parks the connection. When the first call hits the pod, the agent server reuses that connection.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Real numbers from production traces:
| Phase | Cold Pod | Warm Pod |
|---|---|---|
| Pod scheduling (K8s) | 8-15s | 0 |
| Container start | 2-4s | 0 |
| Prewarmer connect to OpenAI | 350-500ms | 0 (already open) |
| Twilio media bridge | 80-120ms | 80-120ms |
| First-token from model | 280-400ms | 200-280ms |
| Total cold start | 700ms-1.1s | 250-400ms |
The big number is K8s pod scheduling, which is why the right answer is predictive HPA: scale up before the surge using a forecast, not after.
Predictive Pre-Warm Strategy
CallSphere uses a Redis-backed surge predictor that runs every 60s and looks at:
- Trailing 5-minute call rate per vertical
- Day-of-week + hour-of-day baseline
- Active campaign queues (outbound batches)
If predicted next-5-minute load > current capacity * 0.7, it asks K8s to scale +1 pod. The new pod takes ~10s to schedule and prewarmer connects in ~500ms, so by the time real traffic hits, it is warm.
async def surge_predictor():
while True:
baseline = redis.get(f"baseline:{day}:{hour}")
recent = redis.zcount(f"calls:{vertical}", now - 300, now)
outbound_queue = await get_pending_outbound(vertical)
predicted = max(baseline, recent * 1.2) + outbound_queue * 0.1
capacity = current_pod_count() * PEAK_CALLS_PER_POD
if predicted > capacity * 0.7:
scale_up(vertical, +1)
await asyncio.sleep(60)
Connection Reuse Inside the Pod
Inside one pod, multiple concurrent calls share the OpenAI Realtime WebSocket pool. A pool of 5 connections handles ~50 concurrent calls comfortably; the bottleneck is Twilio media stream concurrency per pod, not the LLM connection.
Vapi vs CallSphere Cold-Start Comparison
| Metric | Vapi | CallSphere (warm pod) | CallSphere (cold pod) |
|---|---|---|---|
| First-audio target | <500ms | ~250-400ms | ~700-1100ms |
| Pre-warm visibility | Hidden | Per-pod metric | Per-pod metric |
| Predictive scaling | None exposed | Redis-driven HPA | Same |
| Surge cap | Pool dependent | K8s cluster cap | Same |
| Geo-region pinning | Vendor-side | Per-cluster | Per-cluster |
| Cold spike behavior | Queue + spike | Brief spike, predictable | Brief spike |
Cold-Start Timeline
sequenceDiagram
participant Caller
participant Twilio
participant K8s
participant Pod
participant OpenAI as OpenAI Realtime
Note over Caller,Twilio: First call after surge
Caller->>Twilio: SIP INVITE
Twilio->>K8s: WS connect (cold)
K8s->>Pod: Schedule + start (8-15s if cold)
Pod->>OpenAI: WebSocket session.update
OpenAI-->>Pod: session.created (350-500ms)
Pod->>Twilio: Audio bridge ready
Twilio->>Caller: Play hold tone (covers cold)
Pod->>OpenAI: Initial system audio frame
OpenAI-->>Pod: First-token audio (280-400ms)
Pod-->>Twilio: PCM16 24kHz greeting
Twilio-->>Caller: "Hi, this is..."
Practical Cold-Start Optimization Tips
- Use a hold tone for the first 600ms. Covers the perceptual gap and is universally accepted as professional.
- Pre-warm by HPA, not by always-on capacity. Always-on burns money during off-hours.
- Run prewarmer as a sidecar, not in the main process. Otherwise the first call into a pod pays the prewarmer cost.
- Pin pods to nodes with NVMe local volumes. Cuts container start time meaningfully on k3s.
- Use a separate WebSocket pool per vertical. Healthcare and Real Estate have wildly different system prompts; sharing forces re-init.
FAQ
Why doesn't CallSphere use always-on warm capacity like Vapi?
We do, but only at the floor. The HPA min-replicas is sized for baseline load. Above that, predictive scaling handles surge. Always-on for peak burns capacity 80% of the day.
Does the Realtime API charge for idle connections?
The WebSocket itself is free; you pay per audio second processed. A parked connection with no audio costs nothing.
Can you go below 250ms first-audio?
Yes, with edge regions and aggressive caching, but the user-perceptible threshold is ~300ms. Below that you stop noticing improvements unless the use case is extremely conversational (interview prep, language tutoring).
Is this measured end-to-end or just server-side?
End-to-end from Twilio's first media frame to the first PCM16 frame returned. Excludes carrier-side SIP delay (~50-150ms variable).
What happens during a region outage?
K8s rebalances, prewarmer rebuilds connections in 1-2s per pod, and the surge predictor over-provisions for 5 minutes after a healing event.
Try a Live Cold-Start Test
Run the live demo — the first call you trigger after the page idles is a real cold-start; subsequent calls are warm-pod numbers. /features lists per-vertical latency targets.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.