Horizontal Scaling for AI Agents: Running Thousands of Concurrent Agent Sessions

Why Horizontal Scaling Matters for AI Agents

A single AI agent server can typically handle 50 to 200 concurrent conversations before response latency degrades. Each conversation involves holding context in memory, making LLM API calls that block for seconds, and streaming responses back to clients. Vertical scaling — adding more CPU and RAM to one machine — hits a ceiling quickly because the bottleneck is I/O-bound concurrency, not raw compute.

Horizontal scaling adds more server instances behind a load balancer so that thousands of concurrent sessions are distributed across a fleet. The challenge is that AI agent conversations are stateful — each turn depends on the history of previous turns. Designing around this statefulness is the core engineering problem.

Stateless Agent Design

The first principle is to externalize all conversation state. Instead of holding session data in memory on the server process, persist it to a shared store like Redis or a database after every turn:

flowchart LR
    USERS(["Traffic"])
    LB["Geo LB plus<br/>Anycast"]
    EDGE["Edge cache plus<br/>rate limit"]
    APP["Stateless app pods<br/>HPA on QPS"]
    QUEUE[(Async work queue)]
    WORKER["Worker pool<br/>GPU or CPU"]
    CACHE[("Redis cache<br/>LLM responses")]
    DB[("Read replicas<br/>and primary")]
    OBS[(Observability)]
    USERS --> LB --> EDGE --> APP
    APP --> CACHE
    APP --> QUEUE --> WORKER
    APP --> DB
    APP --> OBS
    style LB fill:#4f46e5,stroke:#4338ca,color:#fff
    style WORKER fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style CACHE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#0ea5e9,stroke:#0369a1,color:#fff

import redis
import json
from fastapi import FastAPI, Request

app = FastAPI()
r = redis.Redis(host="redis-cluster", port=6379, decode_responses=True)

SESSION_TTL = 3600  # 1 hour

async def get_session(session_id: str) -> dict:
    data = r.get(f"agent:session:{session_id}")
    if data is None:
        return {"messages": [], "metadata": {}}
    return json.loads(data)

async def save_session(session_id: str, session: dict):
    r.setex(
        f"agent:session:{session_id}",
        SESSION_TTL,
        json.dumps(session),
    )

@app.post("/chat/{session_id}")
async def chat(session_id: str, request: Request):
    body = await request.json()
    session = await get_session(session_id)
    session["messages"].append({"role": "user", "content": body["message"]})

    # Call LLM with full conversation history
    response = await call_agent(session["messages"], session["metadata"])

    session["messages"].append({"role": "assistant", "content": response})
    await save_session(session_id, session)
    return {"response": response}

With this pattern, any server instance can handle any request for any session. The server itself holds no state between requests — it reads state from Redis, processes the turn, and writes state back.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Load Balancer Configuration

For stateless agent servers, round-robin or least-connections load balancing works well. However, if your agent uses WebSocket streaming, you need session affinity (sticky sessions) for the duration of a single streaming response:

# Kubernetes Ingress with sticky sessions for WebSocket
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: agent-ingress
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "agent-route"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
spec:
  rules:
    - host: agents.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: agent-service
                port:
                  number: 8000

The cookie-based affinity ensures a client reconnects to the same pod during an active streaming session, while new sessions are distributed evenly across the fleet.

Auto-Scaling Based on Concurrent Connections

CPU-based auto-scaling is a poor fit for AI agent workloads because servers spend most of their time waiting on LLM API calls. Instead, scale based on active connection count or request concurrency:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-deployment
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: active_ws_connections
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 5
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

This scales up aggressively when each pod averages over 100 active WebSocket connections and scales down conservatively to avoid dropping live conversations.

Graceful Shutdown and Connection Draining

When scaling down, pods must finish in-flight conversations before terminating. Configure a preStop hook and a generous termination grace period:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

spec:
  terminationGracePeriodSeconds: 120
  containers:
    - name: agent
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/sh
              - -c
              - "curl -s localhost:8000/drain && sleep 90"

The drain endpoint tells the server to stop accepting new connections and wait for active conversations to complete their current turn before shutting down.

FAQ

How many concurrent sessions can a single Python agent server handle?

With asyncio and an async framework like FastAPI, a single server can handle 100 to 300 concurrent sessions when the primary bottleneck is waiting on LLM API responses. The actual limit depends on memory per session (typically 50 to 200 KB of conversation context) and the timeout duration of LLM calls.

Should I use sticky sessions or fully stateless routing?

Use fully stateless routing when you externalize all session state to Redis or a database — this gives maximum flexibility for scaling. Use sticky sessions only for the duration of a single WebSocket streaming response, not for the entire conversation lifecycle.

What happens to active conversations during a deployment rollout?

Configure rolling updates with maxSurge: 1 and maxUnavailable: 0 so that new pods come up before old ones terminate. Combined with connection draining and a termination grace period, active conversations complete their current turn on the old pod, and the next turn routes to a new pod.

#HorizontalScaling #AIAgents #LoadBalancing #AutoScaling #DistributedSystems #AgenticAI #LearnAI #AIEngineering

Horizontal Scaling for AI Agents: Running Thousands of Concurrent Agent Sessions

Why Horizontal Scaling Matters for AI Agents

Stateless Agent Design

Load Balancer Configuration

Auto-Scaling Based on Concurrent Connections

Graceful Shutdown and Connection Draining

FAQ

How many concurrent sessions can a single Python agent server handle?

Should I use sticky sessions or fully stateless routing?

What happens to active conversations during a deployment rollout?

Try CallSphere AI Voice Agents

Related Articles You May Like

Personal AI Assistant: How to Pick One for Business in 2026

Free AI Agents in 2026: When Free Wins and When It Costs You

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

A2A Multi-Agent Architecture Patterns (2026 Reference)