Voice AI Concurrency at Scale: CallSphere vs Vapi 100+ Calls
By Sagar Shankaran, Founder of CallSphere
How to scale a voice AI platform to 100+ concurrent calls. K8s HPA, OpenAI Realtime pooling, Twilio media streams. CallSphere vs Vapi capacity tradeoffs.
Key takeaways
TL;DR
Scaling voice AI past 50 concurrent calls is where most stacks break. Vapi abstracts scaling behind a managed plane; you pay per minute and trust the vendor's pool — which usually works, occasionally spikes during platform-wide load. CallSphere runs on k3s with horizontal pod autoscaling, OpenAI Realtime connection pools, and Twilio Media Streams with sticky session routing per call. The Sales platform ships with 5 concurrent outbound by default; the broader platform tunes per vertical to 100+ inbound concurrent on commodity hardware.
This post is the SRE-grade walk-through: what breaks, where to put autoscalers, and which knobs are load-bearing.
What "100 Concurrent Calls" Actually Costs
A single concurrent voice call burns:
- ~32 KB/s ingress audio (PCM16 at 24kHz mono inbound + outbound combined ~120 KB/s)
- ~2-4 outbound LLM tokens per second of speech
- One OpenAI Realtime WebSocket session for its lifetime
- One Twilio Media Stream socket
- Light CPU (mostly waiting), modest memory (~30-50 MB per call for state)
Multiply by 100 and you have:
- 12 MB/s aggregate audio
- 200-400 LLM tokens/s
- 100 WebSockets to OpenAI
- 100 WebSockets to Twilio
- 3-5 GB working memory for state
This is not a heavy workload. The hard part is connection lifecycle correctness under churn.
Vapi Concurrency Approach
Vapi runs a multi-tenant managed plane. From a customer's perspective:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Concurrency limit is implicit; on Team ($99/mo) you typically see soft caps around 50 concurrent
- Enterprise raises the cap on contract negotiation
- Pool warmup is hidden; you cannot pre-scale for a known surge
- Latency spikes correlate with platform-wide load, not just yours
Strengths: zero ops; small teams hit this and forget it.
Weaknesses: opaque caps, no surge planning, geographic pinning is vendor-side, capacity contention with other tenants during their surges.
CallSphere Concurrency Approach
CallSphere is a single-tenant or VPC-deployable platform on k3s. The capacity stack is:
Twilio Media Streams ── load balancer ──┐
▼
┌──────────────────────┐
│ Voice Agent Pods │
│ (FastAPI per pod) │
│ HPA min=2 max=20 │
└──────────────────────┘
│
OpenAI Realtime WS pool (per-pod)
│
Postgres (Prisma) + Redis (state cache)
Each pod is sized for 30-50 concurrent calls. With max=20 replicas, the platform handles 600-1000 concurrent. With cluster autoscaling on top, it scales further but the practical sweet spot is 100-200 concurrent per single-cluster deployment.
HPA Tuning That Matters
The naive CPU-based HPA does not work for voice — calls are I/O bound, CPU stays low. CallSphere uses a custom metric: active_calls_per_pod.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: voice-agent
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: voice-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: active_calls_per_pod
target:
type: AverageValue
averageValue: "30"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 50
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
Three knobs do all the work:
- Target = 30 calls/pod with hard cap at 50
- Scale-up window 30s — fast enough for surge
- Scale-down 300s — slow enough to avoid thrash
Connection Pool Per Pod
Each pod runs an asyncio task that maintains a pool of OpenAI Realtime WebSockets. When a call arrives:
class RealtimePool:
def __init__(self, size: int = 8):
self.pool: asyncio.Queue = asyncio.Queue(maxsize=size)
self.target_size = size
async def acquire(self) -> RealtimeSession:
try:
return await asyncio.wait_for(self.pool.get(), timeout=0.05)
except asyncio.TimeoutError:
return await self._create_session()
async def release(self, session: RealtimeSession):
if self.pool.qsize() < self.target_size and session.healthy:
await session.reset()
await self.pool.put(session)
else:
await session.close()
The 50ms acquire timeout is intentional — if the pool is exhausted, create rather than queue, and trust HPA to add capacity.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Outbound Concurrency Cap
For outbound campaigns (Sales platform), CallSphere caps 5 concurrent by default per campaign. Reason: outbound dial pacing matters for STIR/SHAKEN reputation, and downstream carriers throttle aggressive dialers. The cap is per-campaign, so two simultaneous campaigns with 5 each = 10 total concurrent outbound.
For inbound, no cap — let HPA do its job.
Vapi vs CallSphere Concurrency Comparison
| Dimension | Vapi | CallSphere |
|---|---|---|
| Default concurrency cap | ~50 (Team) | None on inbound; 5/campaign outbound |
| Scaling approach | Vendor-managed pool | K8s HPA on active_calls_per_pod |
| Visibility | Dashboard | Prometheus + Grafana, full metrics |
| Multi-tenant noise | Yes | Single-tenant or VPC option |
| Surge pre-scale | Not exposed | Predictive HPA + manual override |
| Geographic redundancy | Vendor regions | Per-cluster, you choose regions |
| Cost at 100 concurrent | $0.30-0.33/min × usage | Compute + per-minute LLM + Twilio |
| Failure isolation | Tenant blast radius | Pod-level isolation |
Scaling Architecture
graph TB
Caller1[Caller 1] --> Twilio
Caller2[Caller 2] --> Twilio
CallerN[Caller N] --> Twilio
Twilio[Twilio Media Streams] --> LB[Ingress LB]
LB --> Pod1[Voice Pod 1<br/>30 calls]
LB --> Pod2[Voice Pod 2<br/>28 calls]
LB --> Pod3[Voice Pod 3<br/>32 calls]
Pod1 --> Pool1[Realtime WS Pool]
Pod2 --> Pool2[Realtime WS Pool]
Pod3 --> Pool3[Realtime WS Pool]
Pool1 --> OpenAI[OpenAI Realtime API]
Pool2 --> OpenAI
Pool3 --> OpenAI
HPA[HPA Controller] -.->|active_calls_per_pod| Pod1
HPA -.-> Pod2
HPA -.-> Pod3
Metrics[Prometheus] -->|scrape| Pod1
Metrics --> Pod2
Metrics --> Pod3
Metrics --> HPA
Pod1 --> Redis[(Redis state)]
Pod2 --> Redis
Pod3 --> Redis
Pod1 --> PG[(Postgres)]
Pod2 --> PG
Pod3 --> PG
Failure Modes and Their Fixes
- Pod restart mid-call. Sticky session routing through Twilio's persistent media WS keeps the call attached; state replay from Redis on the new pod completes the recovery.
- OpenAI Realtime spike. Pool acquire times out, pod creates a new session, retries up to 2x then degrades to TTS-only fallback voice with a static script.
- Twilio rate-limit on outbound. Campaign throttles to 1 concurrent and emits a Slack alert.
- Database hot row on call_logs. Writes batched every 250ms per pod, flushed on call end.
FAQ
What is the actual single-pod ceiling?
~50 concurrent calls on a 4-vCPU pod. We tested to 80 but quality of voice activity detection degrades above 60.
How do you handle Twilio's 1000 calls/sec API limit on outbound?
Outbound queue with token-bucket rate limiter at 30 calls/sec, plenty of headroom.
Does CallSphere shard by tenant?
At the cluster level, yes — large customers get their own k3s namespace. Within a namespace, pods are pooled across tenants of the same vertical.
What is the per-call memory footprint?
30-50 MB steady, ~100 MB during a tool-heavy multi-handoff turn.
Can you burst beyond cluster capacity?
Yes — overflow campaigns get queued in Redis and replay when capacity opens. Inbound never queues; if all pods are at cap, we add more pods within ~30s.
Plan Your Capacity
The /features page documents per-vertical concurrency defaults, and the /demo interactive flow shows pod-level metrics live during a multi-call test session.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.