Skip to content
Technical Guides
Technical Guides13 min read0 views

Cold-Start Voice AI Performance: CallSphere vs Vapi Benchmarks

Detailed cold-start benchmarks for voice AI: WebSocket setup, model warmup, first-token latency. Compare CallSphere on K8s vs Vapi managed pipeline.

TL;DR

Cold start in voice AI is the time from the first SIP RING to the first agent token spoken. It matters most when call volume is bursty (think clinic morning rush, real estate Saturday surge, after-hours storm). Vapi ships a managed warm pool, which gives you a smooth ~400-600ms cold start at the cost of opacity. CallSphere runs on K8s with hostPath hot-reload, an OpenAI Realtime WebSocket pre-warmed per pod, and Twilio media streams; cold start is ~700ms-1.1s for the first call into a freshly scaled pod, ~250-400ms thereafter.

If you can predict surge, CallSphere's HPA (Horizontal Pod Autoscaler) plus a pre-warm sidecar gets you the same numbers as Vapi with full transparency.

What "Cold Start" Actually Means in Voice AI

Three things have to happen before the agent can speak:

  1. Telephony attach — Twilio or your SIP trunk has to bridge media to your application.
  2. Realtime session establish — open a WebSocket to OpenAI Realtime, send the session.update with system prompt, voice, and tools, and receive the session.created event.
  3. First-token generation — once audio starts flowing, the model has to emit its first audible token.

Each adds latency. In a steady-state call (#2 already pre-warmed), only #1 and #3 contribute. In a true cold start (#2 not pre-warmed), all three stack.

Vapi Cold-Start Approach

Vapi runs a managed warm pool of LLM connections. When a new call lands:

  • Their SIP gateway picks an existing warm worker
  • The worker has an OpenAI/Anthropic connection already open
  • Their reported time-to-first-audio is sub-500ms in their docs

Trade-offs:

  • You do not control pool size
  • Burst beyond pool capacity adds queue time you cannot inspect
  • Latency spikes during cross-region failover
  • No way to pre-warm by your own forecast

CallSphere Cold-Start Approach

CallSphere runs on k3s with hostPath volumes for backend hot-reload. The voice path is:

Twilio Media Streams (WebSocket)
  ↓
Python FastAPI agent server (per-pod)
  ↓
OpenAI Realtime API (WebSocket, gpt-4o-realtime-preview-2025-06-03)

Each pod boots with a prewarmer sidecar: it opens an OpenAI Realtime WebSocket, sends a no-op session.update, and parks the connection. When the first call hits the pod, the agent server reuses that connection.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Real numbers from production traces:

Phase Cold Pod Warm Pod
Pod scheduling (K8s) 8-15s 0
Container start 2-4s 0
Prewarmer connect to OpenAI 350-500ms 0 (already open)
Twilio media bridge 80-120ms 80-120ms
First-token from model 280-400ms 200-280ms
Total cold start 700ms-1.1s 250-400ms

The big number is K8s pod scheduling, which is why the right answer is predictive HPA: scale up before the surge using a forecast, not after.

Predictive Pre-Warm Strategy

CallSphere uses a Redis-backed surge predictor that runs every 60s and looks at:

  • Trailing 5-minute call rate per vertical
  • Day-of-week + hour-of-day baseline
  • Active campaign queues (outbound batches)

If predicted next-5-minute load > current capacity * 0.7, it asks K8s to scale +1 pod. The new pod takes ~10s to schedule and prewarmer connects in ~500ms, so by the time real traffic hits, it is warm.

async def surge_predictor():
    while True:
        baseline = redis.get(f"baseline:{day}:{hour}")
        recent = redis.zcount(f"calls:{vertical}", now - 300, now)
        outbound_queue = await get_pending_outbound(vertical)

        predicted = max(baseline, recent * 1.2) + outbound_queue * 0.1
        capacity = current_pod_count() * PEAK_CALLS_PER_POD

        if predicted > capacity * 0.7:
            scale_up(vertical, +1)

        await asyncio.sleep(60)

Connection Reuse Inside the Pod

Inside one pod, multiple concurrent calls share the OpenAI Realtime WebSocket pool. A pool of 5 connections handles ~50 concurrent calls comfortably; the bottleneck is Twilio media stream concurrency per pod, not the LLM connection.

Vapi vs CallSphere Cold-Start Comparison

Metric Vapi CallSphere (warm pod) CallSphere (cold pod)
First-audio target <500ms ~250-400ms ~700-1100ms
Pre-warm visibility Hidden Per-pod metric Per-pod metric
Predictive scaling None exposed Redis-driven HPA Same
Surge cap Pool dependent K8s cluster cap Same
Geo-region pinning Vendor-side Per-cluster Per-cluster
Cold spike behavior Queue + spike Brief spike, predictable Brief spike

Cold-Start Timeline

sequenceDiagram
    participant Caller
    participant Twilio
    participant K8s
    participant Pod
    participant OpenAI as OpenAI Realtime

    Note over Caller,Twilio: First call after surge
    Caller->>Twilio: SIP INVITE
    Twilio->>K8s: WS connect (cold)
    K8s->>Pod: Schedule + start (8-15s if cold)
    Pod->>OpenAI: WebSocket session.update
    OpenAI-->>Pod: session.created (350-500ms)
    Pod->>Twilio: Audio bridge ready
    Twilio->>Caller: Play hold tone (covers cold)
    Pod->>OpenAI: Initial system audio frame
    OpenAI-->>Pod: First-token audio (280-400ms)
    Pod-->>Twilio: PCM16 24kHz greeting
    Twilio-->>Caller: "Hi, this is..."

Practical Cold-Start Optimization Tips

  • Use a hold tone for the first 600ms. Covers the perceptual gap and is universally accepted as professional.
  • Pre-warm by HPA, not by always-on capacity. Always-on burns money during off-hours.
  • Run prewarmer as a sidecar, not in the main process. Otherwise the first call into a pod pays the prewarmer cost.
  • Pin pods to nodes with NVMe local volumes. Cuts container start time meaningfully on k3s.
  • Use a separate WebSocket pool per vertical. Healthcare and Real Estate have wildly different system prompts; sharing forces re-init.

FAQ

Why doesn't CallSphere use always-on warm capacity like Vapi?

We do, but only at the floor. The HPA min-replicas is sized for baseline load. Above that, predictive scaling handles surge. Always-on for peak burns capacity 80% of the day.

Does the Realtime API charge for idle connections?

The WebSocket itself is free; you pay per audio second processed. A parked connection with no audio costs nothing.

Can you go below 250ms first-audio?

Yes, with edge regions and aggressive caching, but the user-perceptible threshold is ~300ms. Below that you stop noticing improvements unless the use case is extremely conversational (interview prep, language tutoring).

Is this measured end-to-end or just server-side?

End-to-end from Twilio's first media frame to the first PCM16 frame returned. Excludes carrier-side SIP delay (~50-150ms variable).

What happens during a region outage?

K8s rebalances, prewarmer rebuilds connections in 1-2s per pod, and the surge predictor over-provisions for 5 minutes after a healing event.

Try a Live Cold-Start Test

Run the live demo — the first call you trigger after the page idles is a real cold-start; subsequent calls are warm-pod numbers. /features lists per-vertical latency targets.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.