Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide
Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.
Ten calls is easy, a thousand is a different animal
A voice agent that handles ten calls on a single pod is a prototype. A voice agent that handles a thousand simultaneous calls is a distributed system with all the problems that come with it — sticky sessions, connection limits, queue back-pressure, graceful drain, regional failover. The transition from ten to a thousand is where most teams ship an outage.
This post walks through the architecture patterns CallSphere uses to scale its voice plane horizontally without losing the sub-second latency budget.
1 pod × 20-40 calls → horizontal scaling
50-200 pods → sticky routing
sticky routing → regional failover
regional failover → global queue drain
Architecture overview
┌──────────────────────────────────────┐
│ Twilio / SIP carriers │
└────────────────┬─────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Global Anycast ingress │
│ (session affinity by Call SID) │
└────────────────┬─────────────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Pod 1 │ │ Pod 2 │ │ Pod N │
│ 30 calls│ │ 30 calls│ │ 30 calls│
└─────┬───┘ └────┬────┘ └────┬────┘
│ │ │
└──────────┴───────────┘
│
▼
┌──────────────────────────────────────┐
│ OpenAI Realtime API │
│ (org-level concurrent limit) │
└──────────────────────────────────────┘
Prerequisites
- Kubernetes (or equivalent container orchestrator).
- An ingress that supports WebSocket session affinity.
- Autoscaling based on custom metrics (active calls per pod).
- A global control plane for routing and failover.
Step-by-step walkthrough
1. Right-size the per-pod call count
One FastAPI process can handle 20-40 concurrent Realtime sessions before event-loop contention bites. Use that as your per-pod capacity.
flowchart LR
GIT(["Git push"])
CI["GitHub Actions<br/>build plus test"]
REG[("Container registry<br/>GHCR or ECR")]
HELM["Helm chart<br/>values per env"]
K8S{"Kubernetes cluster"}
DEP["Deployment<br/>rolling update"]
SVC["Service plus Ingress"]
HPA["HPA<br/>CPU and queue depth"]
POD[("Inference pods<br/>GPU node pool")]
USERS(["Production traffic"])
GIT --> CI --> REG --> HELM --> K8S
K8S --> DEP --> POD
K8S --> SVC --> POD
K8S --> HPA --> POD
SVC --> USERS
style CI fill:#4f46e5,stroke:#4338ca,color:#fff
style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style USERS fill:#059669,stroke:#047857,color:#fff
apiVersion: apps/v1
kind: Deployment
metadata:
name: voice-edge
spec:
replicas: 30
template:
spec:
containers:
- name: edge
image: ghcr.io/yourco/voice-edge:latest
resources:
requests: {cpu: "1", memory: "1Gi"}
limits: {cpu: "2", memory: "2Gi"}
readinessProbe:
httpGet: {path: /ready, port: 8080}
2. Use sticky routing keyed by Call SID
apiVersion: v1
kind: Service
metadata:
name: voice-edge
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
For HTTP ingress, use cookie-based affinity and include the Call SID in the routing header.
3. Scale on active calls, not CPU
CPU is a lagging indicator. Expose an active_calls metric and scale on it directly.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
from prometheus_client import Gauge
ACTIVE = Gauge("voice_active_calls", "concurrent calls on this pod")
async def on_call_start():
ACTIVE.inc()
async def on_call_end():
ACTIVE.dec()
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: voice-edge-hpa
spec:
scaleTargetRef: {kind: Deployment, name: voice-edge}
minReplicas: 10
maxReplicas: 200
metrics:
- type: Pods
pods:
metric: {name: voice_active_calls}
target: {type: AverageValue, averageValue: "25"}
4. Implement graceful drain
On shutdown, stop accepting new calls but keep existing sessions alive until they end or hit a max drain timeout.
import signal
shutting_down = False
def handle_sigterm(*_):
global shutting_down
shutting_down = True
signal.signal(signal.SIGTERM, handle_sigterm)
@app.post("/voice")
async def voice(req):
if shutting_down:
return Response(status_code=503)
return accept_call(req)
5. Handle OpenAI concurrent limits
OpenAI rate-limits concurrent Realtime sessions per org. Track usage in Redis and back-pressure at the ingress if you are at the ceiling.
async def try_reserve_slot() -> bool:
count = await r.incr("openai:active")
if count > MAX_ORG_CONCURRENT:
await r.decr("openai:active")
return False
return True
6. Multi-region for disaster recovery
Run the full stack in two regions. Use Twilio's regional endpoints and Anycast DNS for failover.
Production considerations
- Connection pooling: keep HTTP clients alive across calls; do not recreate per session.
- Memory: audio buffers and transcripts grow during long calls; cap them.
- Queue depth: post-call workers must drain faster than inflow.
- Chaos testing: kill pods under load; make sure ongoing calls survive failover.
- Observability: p95 latency per pod, queue depth, OpenAI quota usage.
CallSphere's real implementation
CallSphere's voice edge runs on Kubernetes with FastAPI pods co-located with Twilio's media regions. Each pod handles 20-40 concurrent Realtime sessions using gpt-4o-realtime-preview-2025-06-03 at 24kHz PCM16 with server VAD. Autoscaling is driven by the active_calls Prometheus metric, graceful drain is wired to SIGTERM, and OpenAI org-level concurrency is tracked in Redis so back-pressure kicks in before the API returns 429s.
The multi-agent verticals — 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10 plus RAG IT helpdesk tools, and the 5-specialist ElevenLabs sales pod — all share the same edge plane, distinguished only by which tool schema they load at session setup. OpenAI Agents SDK handoffs stay inside one session, so scaling doesn't break multi-agent handoffs. CallSphere supports 57+ languages and sub-second end-to-end latency at scale.
Common pitfalls
- Scaling on CPU: you will under-provision under bursty voice load.
- Re-creating HTTP clients per call: socket exhaustion.
- No graceful drain: rolling deploys will kill live calls.
- Single region: a regional outage = full outage.
- Skipping rate-limit awareness: you will hit OpenAI 429s in production.
FAQ
How many pods do I need for 1000 concurrent calls?
At 25 calls/pod, about 40 pods plus 20% headroom.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
What about stateful DB connections?
Use pgbouncer or a managed pool; do not open per-call.
Can I run this on Fargate or Cloud Run?
Fargate yes, Cloud Run no — it does not support long-lived WebSockets reliably.
What is the bottleneck past 1000 calls?
Usually OpenAI quota and DB connections, not CPU.
How do I test scaling?
Use a WebSocket load generator that simulates Twilio Media Streams.
Next steps
Planning a high-concurrency rollout? Book a demo, explore the technology page, or compare pricing.
#CallSphere #Scaling #Kubernetes #VoiceAI #Performance #Architecture #AIVoiceAgents
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.