---
title: "Voice AI Concurrency at Scale: CallSphere vs Vapi 100+ Calls"
description: "How to scale a voice AI platform to 100+ concurrent calls. K8s HPA, OpenAI Realtime pooling, Twilio media streams. CallSphere vs Vapi capacity tradeoffs."
canonical: https://callsphere.ai/blog/voice-ai-concurrency-scaling-callsphere-vs-vapi
category: "Technical Guides"
tags: ["Voice AI Scaling", "Kubernetes", "Concurrency", "CallSphere", "Vapi", "HPA", "Twilio"]
author: "CallSphere Team"
published: 2026-04-18T00:00:00.000Z
updated: 2026-05-08T22:32:09.630Z
---

# Voice AI Concurrency at Scale: CallSphere vs Vapi 100+ Calls

> How to scale a voice AI platform to 100+ concurrent calls. K8s HPA, OpenAI Realtime pooling, Twilio media streams. CallSphere vs Vapi capacity tradeoffs.

## TL;DR

Scaling voice AI past 50 concurrent calls is where most stacks break. **Vapi** abstracts scaling behind a managed plane; you pay per minute and trust the vendor's pool — which usually works, occasionally spikes during platform-wide load. **CallSphere** runs on **k3s with horizontal pod autoscaling, OpenAI Realtime connection pools, and Twilio Media Streams** with sticky session routing per call. The Sales platform ships with 5 concurrent outbound by default; the broader platform tunes per vertical to 100+ inbound concurrent on commodity hardware.

This post is the SRE-grade walk-through: what breaks, where to put autoscalers, and which knobs are load-bearing.

## What "100 Concurrent Calls" Actually Costs

A single concurrent voice call burns:

- ~32 KB/s ingress audio (PCM16 at 24kHz mono inbound + outbound combined ~120 KB/s)
- ~2-4 outbound LLM tokens per second of speech
- One OpenAI Realtime WebSocket session for its lifetime
- One Twilio Media Stream socket
- Light CPU (mostly waiting), modest memory (~30-50 MB per call for state)

Multiply by 100 and you have:

- 12 MB/s aggregate audio
- 200-400 LLM tokens/s
- 100 WebSockets to OpenAI
- 100 WebSockets to Twilio
- 3-5 GB working memory for state

This is not a heavy workload. The hard part is **connection lifecycle correctness** under churn.

## Vapi Concurrency Approach

Vapi runs a multi-tenant managed plane. From a customer's perspective:

- Concurrency limit is implicit; on Team ($99/mo) you typically see soft caps around 50 concurrent
- Enterprise raises the cap on contract negotiation
- Pool warmup is hidden; you cannot pre-scale for a known surge
- Latency spikes correlate with platform-wide load, not just yours

**Strengths:** zero ops; small teams hit this and forget it.

**Weaknesses:** opaque caps, no surge planning, geographic pinning is vendor-side, capacity contention with other tenants during their surges.

## CallSphere Concurrency Approach

CallSphere is a single-tenant or VPC-deployable platform on k3s. The capacity stack is:

```
Twilio Media Streams ── load balancer ──┐
                                         ▼
                              ┌──────────────────────┐
                              │  Voice Agent Pods    │
                              │  (FastAPI per pod)   │
                              │  HPA min=2 max=20    │
                              └──────────────────────┘
                                         │
                          OpenAI Realtime WS pool (per-pod)
                                         │
                        Postgres (Prisma) + Redis (state cache)
```

Each pod is sized for **30-50 concurrent calls**. With max=20 replicas, the platform handles 600-1000 concurrent. With cluster autoscaling on top, it scales further but the practical sweet spot is 100-200 concurrent per single-cluster deployment.

### HPA Tuning That Matters

The naive CPU-based HPA does not work for voice — calls are I/O bound, CPU stays low. CallSphere uses a **custom metric**: `active_calls_per_pod`.

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voice-agent
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: voice-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: active_calls_per_pod
        target:
          type: AverageValue
          averageValue: "30"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 50
          periodSeconds: 30
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
```

Three knobs do all the work:

- **Target = 30 calls/pod** with hard cap at 50
- **Scale-up window 30s** — fast enough for surge
- **Scale-down 300s** — slow enough to avoid thrash

### Connection Pool Per Pod

Each pod runs an asyncio task that maintains a pool of OpenAI Realtime WebSockets. When a call arrives:

```python
class RealtimePool:
    def __init__(self, size: int = 8):
        self.pool: asyncio.Queue = asyncio.Queue(maxsize=size)
        self.target_size = size

    async def acquire(self) -> RealtimeSession:
        try:
            return await asyncio.wait_for(self.pool.get(), timeout=0.05)
        except asyncio.TimeoutError:
            return await self._create_session()

    async def release(self, session: RealtimeSession):
        if self.pool.qsize()  Twilio
    Caller2[Caller 2] --> Twilio
    CallerN[Caller N] --> Twilio
    Twilio[Twilio Media Streams] --> LB[Ingress LB]
    LB --> Pod1[Voice Pod 1
30 calls]
    LB --> Pod2[Voice Pod 2
28 calls]
    LB --> Pod3[Voice Pod 3
32 calls]
    Pod1 --> Pool1[Realtime WS Pool]
    Pod2 --> Pool2[Realtime WS Pool]
    Pod3 --> Pool3[Realtime WS Pool]
    Pool1 --> OpenAI[OpenAI Realtime API]
    Pool2 --> OpenAI
    Pool3 --> OpenAI
    HPA[HPA Controller] -.->|active_calls_per_pod| Pod1
    HPA -.-> Pod2
    HPA -.-> Pod3
    Metrics[Prometheus] -->|scrape| Pod1
    Metrics --> Pod2
    Metrics --> Pod3
    Metrics --> HPA
    Pod1 --> Redis[(Redis state)]
    Pod2 --> Redis
    Pod3 --> Redis
    Pod1 --> PG[(Postgres)]
    Pod2 --> PG
    Pod3 --> PG
```

## Failure Modes and Their Fixes

- **Pod restart mid-call.** Sticky session routing through Twilio's persistent media WS keeps the call attached; state replay from Redis on the new pod completes the recovery.
- **OpenAI Realtime spike.** Pool acquire times out, pod creates a new session, retries up to 2x then degrades to TTS-only fallback voice with a static script.
- **Twilio rate-limit on outbound.** Campaign throttles to 1 concurrent and emits a Slack alert.
- **Database hot row on call_logs.** Writes batched every 250ms per pod, flushed on call end.

## FAQ

### What is the actual single-pod ceiling?

~50 concurrent calls on a 4-vCPU pod. We tested to 80 but quality of voice activity detection degrades above 60.

### How do you handle Twilio's 1000 calls/sec API limit on outbound?

Outbound queue with token-bucket rate limiter at 30 calls/sec, plenty of headroom.

### Does CallSphere shard by tenant?

At the cluster level, yes — large customers get their own k3s namespace. Within a namespace, pods are pooled across tenants of the same vertical.

### What is the per-call memory footprint?

30-50 MB steady, ~100 MB during a tool-heavy multi-handoff turn.

### Can you burst beyond cluster capacity?

Yes — overflow campaigns get queued in Redis and replay when capacity opens. Inbound never queues; if all pods are at cap, we add more pods within ~30s.

## Plan Your Capacity

The [/features](/features) page documents per-vertical concurrency defaults, and the [/demo](/demo) interactive flow shows pod-level metrics live during a multi-call test session.

---

Source: https://callsphere.ai/blog/voice-ai-concurrency-scaling-callsphere-vs-vapi