Skip to content
Technical Guides
Technical Guides15 min read0 views

Voice AI Concurrency at Scale: CallSphere vs Vapi 100+ Calls

How to scale a voice AI platform to 100+ concurrent calls. K8s HPA, OpenAI Realtime pooling, Twilio media streams. CallSphere vs Vapi capacity tradeoffs.

TL;DR

Scaling voice AI past 50 concurrent calls is where most stacks break. Vapi abstracts scaling behind a managed plane; you pay per minute and trust the vendor's pool — which usually works, occasionally spikes during platform-wide load. CallSphere runs on k3s with horizontal pod autoscaling, OpenAI Realtime connection pools, and Twilio Media Streams with sticky session routing per call. The Sales platform ships with 5 concurrent outbound by default; the broader platform tunes per vertical to 100+ inbound concurrent on commodity hardware.

This post is the SRE-grade walk-through: what breaks, where to put autoscalers, and which knobs are load-bearing.

What "100 Concurrent Calls" Actually Costs

A single concurrent voice call burns:

  • ~32 KB/s ingress audio (PCM16 at 24kHz mono inbound + outbound combined ~120 KB/s)
  • ~2-4 outbound LLM tokens per second of speech
  • One OpenAI Realtime WebSocket session for its lifetime
  • One Twilio Media Stream socket
  • Light CPU (mostly waiting), modest memory (~30-50 MB per call for state)

Multiply by 100 and you have:

  • 12 MB/s aggregate audio
  • 200-400 LLM tokens/s
  • 100 WebSockets to OpenAI
  • 100 WebSockets to Twilio
  • 3-5 GB working memory for state

This is not a heavy workload. The hard part is connection lifecycle correctness under churn.

Vapi Concurrency Approach

Vapi runs a multi-tenant managed plane. From a customer's perspective:

  • Concurrency limit is implicit; on Team ($99/mo) you typically see soft caps around 50 concurrent
  • Enterprise raises the cap on contract negotiation
  • Pool warmup is hidden; you cannot pre-scale for a known surge
  • Latency spikes correlate with platform-wide load, not just yours

Strengths: zero ops; small teams hit this and forget it.

Weaknesses: opaque caps, no surge planning, geographic pinning is vendor-side, capacity contention with other tenants during their surges.

CallSphere Concurrency Approach

CallSphere is a single-tenant or VPC-deployable platform on k3s. The capacity stack is:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Twilio Media Streams ── load balancer ──┐
                                         ▼
                              ┌──────────────────────┐
                              │  Voice Agent Pods    │
                              │  (FastAPI per pod)   │
                              │  HPA min=2 max=20    │
                              └──────────────────────┘
                                         │
                          OpenAI Realtime WS pool (per-pod)
                                         │
                        Postgres (Prisma) + Redis (state cache)

Each pod is sized for 30-50 concurrent calls. With max=20 replicas, the platform handles 600-1000 concurrent. With cluster autoscaling on top, it scales further but the practical sweet spot is 100-200 concurrent per single-cluster deployment.

HPA Tuning That Matters

The naive CPU-based HPA does not work for voice — calls are I/O bound, CPU stays low. CallSphere uses a custom metric: active_calls_per_pod.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voice-agent
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: voice-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: active_calls_per_pod
        target:
          type: AverageValue
          averageValue: "30"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 50
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60

Three knobs do all the work:

  • Target = 30 calls/pod with hard cap at 50
  • Scale-up window 30s — fast enough for surge
  • Scale-down 300s — slow enough to avoid thrash

Connection Pool Per Pod

Each pod runs an asyncio task that maintains a pool of OpenAI Realtime WebSockets. When a call arrives:

class RealtimePool:
    def __init__(self, size: int = 8):
        self.pool: asyncio.Queue = asyncio.Queue(maxsize=size)
        self.target_size = size

    async def acquire(self) -> RealtimeSession:
        try:
            return await asyncio.wait_for(self.pool.get(), timeout=0.05)
        except asyncio.TimeoutError:
            return await self._create_session()

    async def release(self, session: RealtimeSession):
        if self.pool.qsize() < self.target_size and session.healthy:
            await session.reset()
            await self.pool.put(session)
        else:
            await session.close()

The 50ms acquire timeout is intentional — if the pool is exhausted, create rather than queue, and trust HPA to add capacity.

Outbound Concurrency Cap

For outbound campaigns (Sales platform), CallSphere caps 5 concurrent by default per campaign. Reason: outbound dial pacing matters for STIR/SHAKEN reputation, and downstream carriers throttle aggressive dialers. The cap is per-campaign, so two simultaneous campaigns with 5 each = 10 total concurrent outbound.

For inbound, no cap — let HPA do its job.

Vapi vs CallSphere Concurrency Comparison

Dimension Vapi CallSphere
Default concurrency cap ~50 (Team) None on inbound; 5/campaign outbound
Scaling approach Vendor-managed pool K8s HPA on active_calls_per_pod
Visibility Dashboard Prometheus + Grafana, full metrics
Multi-tenant noise Yes Single-tenant or VPC option
Surge pre-scale Not exposed Predictive HPA + manual override
Geographic redundancy Vendor regions Per-cluster, you choose regions
Cost at 100 concurrent $0.30-0.33/min × usage Compute + per-minute LLM + Twilio
Failure isolation Tenant blast radius Pod-level isolation

Scaling Architecture

graph TB
    Caller1[Caller 1] --> Twilio
    Caller2[Caller 2] --> Twilio
    CallerN[Caller N] --> Twilio
    Twilio[Twilio Media Streams] --> LB[Ingress LB]
    LB --> Pod1[Voice Pod 1<br/>30 calls]
    LB --> Pod2[Voice Pod 2<br/>28 calls]
    LB --> Pod3[Voice Pod 3<br/>32 calls]
    Pod1 --> Pool1[Realtime WS Pool]
    Pod2 --> Pool2[Realtime WS Pool]
    Pod3 --> Pool3[Realtime WS Pool]
    Pool1 --> OpenAI[OpenAI Realtime API]
    Pool2 --> OpenAI
    Pool3 --> OpenAI
    HPA[HPA Controller] -.->|active_calls_per_pod| Pod1
    HPA -.-> Pod2
    HPA -.-> Pod3
    Metrics[Prometheus] -->|scrape| Pod1
    Metrics --> Pod2
    Metrics --> Pod3
    Metrics --> HPA
    Pod1 --> Redis[(Redis state)]
    Pod2 --> Redis
    Pod3 --> Redis
    Pod1 --> PG[(Postgres)]
    Pod2 --> PG
    Pod3 --> PG

Failure Modes and Their Fixes

  • Pod restart mid-call. Sticky session routing through Twilio's persistent media WS keeps the call attached; state replay from Redis on the new pod completes the recovery.
  • OpenAI Realtime spike. Pool acquire times out, pod creates a new session, retries up to 2x then degrades to TTS-only fallback voice with a static script.
  • Twilio rate-limit on outbound. Campaign throttles to 1 concurrent and emits a Slack alert.
  • Database hot row on call_logs. Writes batched every 250ms per pod, flushed on call end.

FAQ

What is the actual single-pod ceiling?

~50 concurrent calls on a 4-vCPU pod. We tested to 80 but quality of voice activity detection degrades above 60.

How do you handle Twilio's 1000 calls/sec API limit on outbound?

Outbound queue with token-bucket rate limiter at 30 calls/sec, plenty of headroom.

Does CallSphere shard by tenant?

At the cluster level, yes — large customers get their own k3s namespace. Within a namespace, pods are pooled across tenants of the same vertical.

What is the per-call memory footprint?

30-50 MB steady, ~100 MB during a tool-heavy multi-handoff turn.

Can you burst beyond cluster capacity?

Yes — overflow campaigns get queued in Redis and replay when capacity opens. Inbound never queues; if all pods are at cap, we add more pods within ~30s.

Plan Your Capacity

The /features page documents per-vertical concurrency defaults, and the /demo interactive flow shows pod-level metrics live during a multi-call test session.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.