Voice AI Concurrency at Scale: CallSphere vs Vapi 100+ Calls
How to scale a voice AI platform to 100+ concurrent calls. K8s HPA, OpenAI Realtime pooling, Twilio media streams. CallSphere vs Vapi capacity tradeoffs.
TL;DR
Scaling voice AI past 50 concurrent calls is where most stacks break. Vapi abstracts scaling behind a managed plane; you pay per minute and trust the vendor's pool — which usually works, occasionally spikes during platform-wide load. CallSphere runs on k3s with horizontal pod autoscaling, OpenAI Realtime connection pools, and Twilio Media Streams with sticky session routing per call. The Sales platform ships with 5 concurrent outbound by default; the broader platform tunes per vertical to 100+ inbound concurrent on commodity hardware.
This post is the SRE-grade walk-through: what breaks, where to put autoscalers, and which knobs are load-bearing.
What "100 Concurrent Calls" Actually Costs
A single concurrent voice call burns:
- ~32 KB/s ingress audio (PCM16 at 24kHz mono inbound + outbound combined ~120 KB/s)
- ~2-4 outbound LLM tokens per second of speech
- One OpenAI Realtime WebSocket session for its lifetime
- One Twilio Media Stream socket
- Light CPU (mostly waiting), modest memory (~30-50 MB per call for state)
Multiply by 100 and you have:
- 12 MB/s aggregate audio
- 200-400 LLM tokens/s
- 100 WebSockets to OpenAI
- 100 WebSockets to Twilio
- 3-5 GB working memory for state
This is not a heavy workload. The hard part is connection lifecycle correctness under churn.
Vapi Concurrency Approach
Vapi runs a multi-tenant managed plane. From a customer's perspective:
- Concurrency limit is implicit; on Team ($99/mo) you typically see soft caps around 50 concurrent
- Enterprise raises the cap on contract negotiation
- Pool warmup is hidden; you cannot pre-scale for a known surge
- Latency spikes correlate with platform-wide load, not just yours
Strengths: zero ops; small teams hit this and forget it.
Weaknesses: opaque caps, no surge planning, geographic pinning is vendor-side, capacity contention with other tenants during their surges.
CallSphere Concurrency Approach
CallSphere is a single-tenant or VPC-deployable platform on k3s. The capacity stack is:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Twilio Media Streams ── load balancer ──┐
▼
┌──────────────────────┐
│ Voice Agent Pods │
│ (FastAPI per pod) │
│ HPA min=2 max=20 │
└──────────────────────┘
│
OpenAI Realtime WS pool (per-pod)
│
Postgres (Prisma) + Redis (state cache)
Each pod is sized for 30-50 concurrent calls. With max=20 replicas, the platform handles 600-1000 concurrent. With cluster autoscaling on top, it scales further but the practical sweet spot is 100-200 concurrent per single-cluster deployment.
HPA Tuning That Matters
The naive CPU-based HPA does not work for voice — calls are I/O bound, CPU stays low. CallSphere uses a custom metric: active_calls_per_pod.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: voice-agent
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: voice-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: active_calls_per_pod
target:
type: AverageValue
averageValue: "30"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 50
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
Three knobs do all the work:
- Target = 30 calls/pod with hard cap at 50
- Scale-up window 30s — fast enough for surge
- Scale-down 300s — slow enough to avoid thrash
Connection Pool Per Pod
Each pod runs an asyncio task that maintains a pool of OpenAI Realtime WebSockets. When a call arrives:
class RealtimePool:
def __init__(self, size: int = 8):
self.pool: asyncio.Queue = asyncio.Queue(maxsize=size)
self.target_size = size
async def acquire(self) -> RealtimeSession:
try:
return await asyncio.wait_for(self.pool.get(), timeout=0.05)
except asyncio.TimeoutError:
return await self._create_session()
async def release(self, session: RealtimeSession):
if self.pool.qsize() < self.target_size and session.healthy:
await session.reset()
await self.pool.put(session)
else:
await session.close()
The 50ms acquire timeout is intentional — if the pool is exhausted, create rather than queue, and trust HPA to add capacity.
Outbound Concurrency Cap
For outbound campaigns (Sales platform), CallSphere caps 5 concurrent by default per campaign. Reason: outbound dial pacing matters for STIR/SHAKEN reputation, and downstream carriers throttle aggressive dialers. The cap is per-campaign, so two simultaneous campaigns with 5 each = 10 total concurrent outbound.
For inbound, no cap — let HPA do its job.
Vapi vs CallSphere Concurrency Comparison
| Dimension | Vapi | CallSphere |
|---|---|---|
| Default concurrency cap | ~50 (Team) | None on inbound; 5/campaign outbound |
| Scaling approach | Vendor-managed pool | K8s HPA on active_calls_per_pod |
| Visibility | Dashboard | Prometheus + Grafana, full metrics |
| Multi-tenant noise | Yes | Single-tenant or VPC option |
| Surge pre-scale | Not exposed | Predictive HPA + manual override |
| Geographic redundancy | Vendor regions | Per-cluster, you choose regions |
| Cost at 100 concurrent | $0.30-0.33/min × usage | Compute + per-minute LLM + Twilio |
| Failure isolation | Tenant blast radius | Pod-level isolation |
Scaling Architecture
graph TB
Caller1[Caller 1] --> Twilio
Caller2[Caller 2] --> Twilio
CallerN[Caller N] --> Twilio
Twilio[Twilio Media Streams] --> LB[Ingress LB]
LB --> Pod1[Voice Pod 1<br/>30 calls]
LB --> Pod2[Voice Pod 2<br/>28 calls]
LB --> Pod3[Voice Pod 3<br/>32 calls]
Pod1 --> Pool1[Realtime WS Pool]
Pod2 --> Pool2[Realtime WS Pool]
Pod3 --> Pool3[Realtime WS Pool]
Pool1 --> OpenAI[OpenAI Realtime API]
Pool2 --> OpenAI
Pool3 --> OpenAI
HPA[HPA Controller] -.->|active_calls_per_pod| Pod1
HPA -.-> Pod2
HPA -.-> Pod3
Metrics[Prometheus] -->|scrape| Pod1
Metrics --> Pod2
Metrics --> Pod3
Metrics --> HPA
Pod1 --> Redis[(Redis state)]
Pod2 --> Redis
Pod3 --> Redis
Pod1 --> PG[(Postgres)]
Pod2 --> PG
Pod3 --> PG
Failure Modes and Their Fixes
- Pod restart mid-call. Sticky session routing through Twilio's persistent media WS keeps the call attached; state replay from Redis on the new pod completes the recovery.
- OpenAI Realtime spike. Pool acquire times out, pod creates a new session, retries up to 2x then degrades to TTS-only fallback voice with a static script.
- Twilio rate-limit on outbound. Campaign throttles to 1 concurrent and emits a Slack alert.
- Database hot row on call_logs. Writes batched every 250ms per pod, flushed on call end.
FAQ
What is the actual single-pod ceiling?
~50 concurrent calls on a 4-vCPU pod. We tested to 80 but quality of voice activity detection degrades above 60.
How do you handle Twilio's 1000 calls/sec API limit on outbound?
Outbound queue with token-bucket rate limiter at 30 calls/sec, plenty of headroom.
Does CallSphere shard by tenant?
At the cluster level, yes — large customers get their own k3s namespace. Within a namespace, pods are pooled across tenants of the same vertical.
What is the per-call memory footprint?
30-50 MB steady, ~100 MB during a tool-heavy multi-handoff turn.
Can you burst beyond cluster capacity?
Yes — overflow campaigns get queued in Redis and replay when capacity opens. Inbound never queues; if all pods are at cap, we add more pods within ~30s.
Plan Your Capacity
The /features page documents per-vertical concurrency defaults, and the /demo interactive flow shows pod-level metrics live during a multi-call test session.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.