---
title: "Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide"
description: "Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management."
canonical: https://callsphere.ai/blog/scaling-ai-voice-agents-1000-concurrent-calls
category: "Technical Guides"
tags: ["AI Voice Agent", "Technical Guide", "Scaling", "Architecture", "Kubernetes", "Load Balancing", "Performance"]
author: "CallSphere Team"
published: 2026-04-08T00:00:00.000Z
updated: 2026-06-05T21:06:28.931Z
---

# Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide

> Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.

## Ten calls is easy, a thousand is a different animal

A voice agent that handles ten calls on a single pod is a prototype. A voice agent that handles a thousand simultaneous calls is a distributed system with all the problems that come with it — sticky sessions, connection limits, queue back-pressure, graceful drain, regional failover. The transition from ten to a thousand is where most teams ship an outage.

This post walks through the architecture patterns CallSphere uses to scale its voice plane horizontally without losing the sub-second latency budget.

```
1 pod × 20-40 calls  →  horizontal scaling
50-200 pods          →  sticky routing
sticky routing       →  regional failover
regional failover    →  global queue drain
```

## Architecture overview

```
┌──────────────────────────────────────┐
│ Twilio / SIP carriers                │
└────────────────┬─────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────┐
│ Global Anycast ingress               │
│ (session affinity by Call SID)       │
└────────────────┬─────────────────────┘
                 │
     ┌───────────┼───────────┐
     ▼           ▼           ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Pod 1   │ │ Pod 2   │ │ Pod N   │
│ 30 calls│ │ 30 calls│ │ 30 calls│
└─────┬───┘ └────┬────┘ └────┬────┘
      │          │           │
      └──────────┴───────────┘
                 │
                 ▼
┌──────────────────────────────────────┐
│ OpenAI Realtime API                  │
│ (org-level concurrent limit)         │
└──────────────────────────────────────┘
```

## Prerequisites

- Kubernetes (or equivalent container orchestrator).
- An ingress that supports WebSocket session affinity.
- Autoscaling based on custom metrics (active calls per pod).
- A global control plane for routing and failover.

## Step-by-step walkthrough

### 1. Right-size the per-pod call count

One FastAPI process can handle 20-40 concurrent Realtime sessions before event-loop contention bites. Use that as your per-pod capacity.

```mermaid
flowchart LR
    GIT(["Git push"])
    CI["GitHub Actions
build plus test"]
    REG[("Container registry
GHCR or ECR")]
    HELM["Helm chart
values per env"]
    K8S{"Kubernetes cluster"}
    DEP["Deployment
rolling update"]
    SVC["Service plus Ingress"]
    HPA["HPA
CPU and queue depth"]
    POD[("Inference pods
GPU node pool")]
    USERS(["Production traffic"])
    GIT --> CI --> REG --> HELM --> K8S
    K8S --> DEP --> POD
    K8S --> SVC --> POD
    K8S --> HPA --> POD
    SVC --> USERS
    style CI fill:#4f46e5,stroke:#4338ca,color:#fff
    style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style USERS fill:#059669,stroke:#047857,color:#fff
```

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: voice-edge
spec:
  replicas: 30
  template:
    spec:
      containers:
        - name: edge
          image: ghcr.io/yourco/voice-edge:latest
          resources:
            requests: {cpu: "1", memory: "1Gi"}
            limits: {cpu: "2", memory: "2Gi"}
          readinessProbe:
            httpGet: {path: /ready, port: 8080}
```

### 2. Use sticky routing keyed by Call SID

```yaml
apiVersion: v1
kind: Service
metadata:
  name: voice-edge
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600
```

For HTTP ingress, use cookie-based affinity and include the Call SID in the routing header.

### 3. Scale on active calls, not CPU

CPU is a lagging indicator. Expose an `active_calls` metric and scale on it directly.

```python
from prometheus_client import Gauge
ACTIVE = Gauge("voice_active_calls", "concurrent calls on this pod")

async def on_call_start():
    ACTIVE.inc()

async def on_call_end():
    ACTIVE.dec()
```

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: voice-edge-hpa
spec:
  scaleTargetRef: {kind: Deployment, name: voice-edge}
  minReplicas: 10
  maxReplicas: 200
  metrics:
    - type: Pods
      pods:
        metric: {name: voice_active_calls}
        target: {type: AverageValue, averageValue: "25"}
```

### 4. Implement graceful drain

On shutdown, stop accepting new calls but keep existing sessions alive until they end or hit a max drain timeout.

```python
import signal
shutting_down = False

def handle_sigterm(*_):
    global shutting_down
    shutting_down = True

signal.signal(signal.SIGTERM, handle_sigterm)

@app.post("/voice")
async def voice(req):
    if shutting_down:
        return Response(status_code=503)
    return accept_call(req)
```

### 5. Handle OpenAI concurrent limits

OpenAI rate-limits concurrent Realtime sessions per org. Track usage in Redis and back-pressure at the ingress if you are at the ceiling.

```python
async def try_reserve_slot() -> bool:
    count = await r.incr("openai:active")
    if count > MAX_ORG_CONCURRENT:
        await r.decr("openai:active")
        return False
    return True
```

### 6. Multi-region for disaster recovery

Run the full stack in two regions. Use Twilio's regional endpoints and Anycast DNS for failover.

## Production considerations

- **Connection pooling**: keep HTTP clients alive across calls; do not recreate per session.
- **Memory**: audio buffers and transcripts grow during long calls; cap them.
- **Queue depth**: post-call workers must drain faster than inflow.
- **Chaos testing**: kill pods under load; make sure ongoing calls survive failover.
- **Observability**: p95 latency per pod, queue depth, OpenAI quota usage.

## CallSphere's real implementation

CallSphere's voice edge runs on Kubernetes with FastAPI pods co-located with Twilio's media regions. Each pod handles 20-40 concurrent Realtime sessions using `gpt-4o-realtime-preview-2025-06-03` at 24kHz PCM16 with server VAD. Autoscaling is driven by the `active_calls` Prometheus metric, graceful drain is wired to SIGTERM, and OpenAI org-level concurrency is tracked in Redis so back-pressure kicks in before the API returns 429s.

The multi-agent verticals — 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10 plus RAG IT helpdesk tools, and the 5-specialist ElevenLabs sales pod — all share the same edge plane, distinguished only by which tool schema they load at session setup. OpenAI Agents SDK handoffs stay inside one session, so scaling doesn't break multi-agent handoffs. CallSphere supports 57+ languages and sub-second end-to-end latency at scale.

## Common pitfalls

- **Scaling on CPU**: you will under-provision under bursty voice load.
- **Re-creating HTTP clients per call**: socket exhaustion.
- **No graceful drain**: rolling deploys will kill live calls.
- **Single region**: a regional outage = full outage.
- **Skipping rate-limit awareness**: you will hit OpenAI 429s in production.

## FAQ

### How many pods do I need for 1000 concurrent calls?

At 25 calls/pod, about 40 pods plus 20% headroom.

### What about stateful DB connections?

Use pgbouncer or a managed pool; do not open per-call.

### Can I run this on Fargate or Cloud Run?

Fargate yes, Cloud Run no — it does not support long-lived WebSockets reliably.

### What is the bottleneck past 1000 calls?

Usually OpenAI quota and DB connections, not CPU.

### How do I test scaling?

Use a WebSocket load generator that simulates Twilio Media Streams.

## Next steps

Planning a high-concurrency rollout? [Book a demo](https://callsphere.tech/contact), explore the [technology page](https://callsphere.tech/technology), or compare [pricing](https://callsphere.tech/pricing).

#CallSphere #Scaling #Kubernetes #VoiceAI #Performance #Architecture #AIVoiceAgents

---

Source: https://callsphere.ai/blog/scaling-ai-voice-agents-1000-concurrent-calls
