Voice AI Concurrency at Scale: CallSphere vs Vapi 100+ Calls

TL;DR

Scaling voice AI past 50 concurrent calls is where most stacks break. Vapi abstracts scaling behind a managed plane; you pay per minute and trust the vendor's pool — which usually works, occasionally spikes during platform-wide load. CallSphere runs on k3s with horizontal pod autoscaling, OpenAI Realtime connection pools, and Twilio Media Streams with sticky session routing per call. The Sales platform ships with 5 concurrent outbound by default; the broader platform tunes per vertical to 100+ inbound concurrent on commodity hardware.

This post is the SRE-grade walk-through: what breaks, where to put autoscalers, and which knobs are load-bearing.

What "100 Concurrent Calls" Actually Costs

A single concurrent voice call burns:

~32 KB/s ingress audio (PCM16 at 24kHz mono inbound + outbound combined ~120 KB/s)
~2-4 outbound LLM tokens per second of speech
One OpenAI Realtime WebSocket session for its lifetime
One Twilio Media Stream socket
Light CPU (mostly waiting), modest memory (~30-50 MB per call for state)

Multiply by 100 and you have:

12 MB/s aggregate audio
200-400 LLM tokens/s
100 WebSockets to OpenAI
100 WebSockets to Twilio
3-5 GB working memory for state

This is not a heavy workload. The hard part is connection lifecycle correctness under churn.

Vapi Concurrency Approach

Vapi runs a multi-tenant managed plane. From a customer's perspective:

Concurrency limit is implicit; on Team ($99/mo) you typically see soft caps around 50 concurrent
Enterprise raises the cap on contract negotiation
Pool warmup is hidden; you cannot pre-scale for a known surge
Latency spikes correlate with platform-wide load, not just yours

Strengths: zero ops; small teams hit this and forget it.

Weaknesses: opaque caps, no surge planning, geographic pinning is vendor-side, capacity contention with other tenants during their surges.

CallSphere Concurrency Approach

CallSphere is a single-tenant or VPC-deployable platform on k3s. The capacity stack is:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Twilio Media Streams ── load balancer ──┐
                                         ▼
                              ┌──────────────────────┐
                              │  Voice Agent Pods    │
                              │  (FastAPI per pod)   │
                              │  HPA min=2 max=20    │
                              └──────────────────────┘
                                         │
                          OpenAI Realtime WS pool (per-pod)
                                         │
                        Postgres (Prisma) + Redis (state cache)

Each pod is sized for 30-50 concurrent calls. With max=20 replicas, the platform handles 600-1000 concurrent. With cluster autoscaling on top, it scales further but the practical sweet spot is 100-200 concurrent per single-cluster deployment.

HPA Tuning That Matters

The naive CPU-based HPA does not work for voice — calls are I/O bound, CPU stays low. CallSphere uses a custom metric: active_calls_per_pod.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voice-agent
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: voice-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: active_calls_per_pod
        target:
          type: AverageValue
          averageValue: "30"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 50
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60

Three knobs do all the work:

Target = 30 calls/pod with hard cap at 50
Scale-up window 30s — fast enough for surge
Scale-down 300s — slow enough to avoid thrash

Connection Pool Per Pod

Each pod runs an asyncio task that maintains a pool of OpenAI Realtime WebSockets. When a call arrives:

class RealtimePool:
    def __init__(self, size: int = 8):
        self.pool: asyncio.Queue = asyncio.Queue(maxsize=size)
        self.target_size = size

    async def acquire(self) -> RealtimeSession:
        try:
            return await asyncio.wait_for(self.pool.get(), timeout=0.05)
        except asyncio.TimeoutError:
            return await self._create_session()

    async def release(self, session: RealtimeSession):
        if self.pool.qsize() < self.target_size and session.healthy:
            await session.reset()
            await self.pool.put(session)
        else:
            await session.close()

The 50ms acquire timeout is intentional — if the pool is exhausted, create rather than queue, and trust HPA to add capacity.

Outbound Concurrency Cap

For outbound campaigns (Sales platform), CallSphere caps 5 concurrent by default per campaign. Reason: outbound dial pacing matters for STIR/SHAKEN reputation, and downstream carriers throttle aggressive dialers. The cap is per-campaign, so two simultaneous campaigns with 5 each = 10 total concurrent outbound.

For inbound, no cap — let HPA do its job.

Vapi vs CallSphere Concurrency Comparison

Dimension	Vapi	CallSphere
Default concurrency cap	~50 (Team)	None on inbound; 5/campaign outbound
Scaling approach	Vendor-managed pool	K8s HPA on `active_calls_per_pod`
Visibility	Dashboard	Prometheus + Grafana, full metrics
Multi-tenant noise	Yes	Single-tenant or VPC option
Surge pre-scale	Not exposed	Predictive HPA + manual override
Geographic redundancy	Vendor regions	Per-cluster, you choose regions
Cost at 100 concurrent	$0.30-0.33/min × usage	Compute + per-minute LLM + Twilio
Failure isolation	Tenant blast radius	Pod-level isolation

Scaling Architecture

graph TB
    Caller1[Caller 1] --> Twilio
    Caller2[Caller 2] --> Twilio
    CallerN[Caller N] --> Twilio
    Twilio[Twilio Media Streams] --> LB[Ingress LB]
    LB --> Pod1[Voice Pod 1<br/>30 calls]
    LB --> Pod2[Voice Pod 2<br/>28 calls]
    LB --> Pod3[Voice Pod 3<br/>32 calls]
    Pod1 --> Pool1[Realtime WS Pool]
    Pod2 --> Pool2[Realtime WS Pool]
    Pod3 --> Pool3[Realtime WS Pool]
    Pool1 --> OpenAI[OpenAI Realtime API]
    Pool2 --> OpenAI
    Pool3 --> OpenAI
    HPA[HPA Controller] -.->|active_calls_per_pod| Pod1
    HPA -.-> Pod2
    HPA -.-> Pod3
    Metrics[Prometheus] -->|scrape| Pod1
    Metrics --> Pod2
    Metrics --> Pod3
    Metrics --> HPA
    Pod1 --> Redis[(Redis state)]
    Pod2 --> Redis
    Pod3 --> Redis
    Pod1 --> PG[(Postgres)]
    Pod2 --> PG
    Pod3 --> PG

Failure Modes and Their Fixes

Pod restart mid-call. Sticky session routing through Twilio's persistent media WS keeps the call attached; state replay from Redis on the new pod completes the recovery.
OpenAI Realtime spike. Pool acquire times out, pod creates a new session, retries up to 2x then degrades to TTS-only fallback voice with a static script.
Twilio rate-limit on outbound. Campaign throttles to 1 concurrent and emits a Slack alert.
Database hot row on call_logs. Writes batched every 250ms per pod, flushed on call end.

FAQ

What is the actual single-pod ceiling?

~50 concurrent calls on a 4-vCPU pod. We tested to 80 but quality of voice activity detection degrades above 60.

How do you handle Twilio's 1000 calls/sec API limit on outbound?

Outbound queue with token-bucket rate limiter at 30 calls/sec, plenty of headroom.

Does CallSphere shard by tenant?

At the cluster level, yes — large customers get their own k3s namespace. Within a namespace, pods are pooled across tenants of the same vertical.

What is the per-call memory footprint?

30-50 MB steady, ~100 MB during a tool-heavy multi-handoff turn.

Can you burst beyond cluster capacity?

Yes — overflow campaigns get queued in Redis and replay when capacity opens. Inbound never queues; if all pods are at cap, we add more pods within ~30s.

Plan Your Capacity

The /features page documents per-vertical concurrency defaults, and the /demo interactive flow shows pod-level metrics live during a multi-call test session.

Voice AI Concurrency at Scale: CallSphere vs Vapi 100+ Calls

TL;DR

What "100 Concurrent Calls" Actually Costs

Vapi Concurrency Approach

CallSphere Concurrency Approach

HPA Tuning That Matters

Connection Pool Per Pod

Outbound Concurrency Cap

Vapi vs CallSphere Concurrency Comparison

Scaling Architecture

Failure Modes and Their Fixes

FAQ

What is the actual single-pod ceiling?

How do you handle Twilio's 1000 calls/sec API limit on outbound?

Does CallSphere shard by tenant?

What is the per-call memory footprint?

Can you burst beyond cluster capacity?

Plan Your Capacity

Try CallSphere AI Voice Agents

Related Articles You May Like

Smart Escalation Ladders: CallSphere Built-In vs Vapi DIY

Spam + Robocall Mitigation: CallSphere vs Vapi Reputation Systems

Pre-Wired CRMs (Salesforce/HubSpot): CallSphere vs Vapi Integration Lift

CallSphere vs Dialpad: Enterprise Calling Comparison

Tenant Emergency Voice AI: CallSphere Escalation Ladder vs Vapi

Custom Voice Cloning Pipelines: CallSphere vs Vapi ElevenLabs Setup