Skip to content
Horizontal Scaling for AI Agents: Running Thousands of Concurrent Agent Sessions
Learn Agentic AI14 min read16 views

Horizontal Scaling for AI Agents: Running Thousands of Concurrent Agent Sessions

Learn how to horizontally scale AI agent systems to handle thousands of concurrent sessions using stateless design, session affinity, load balancing, and auto-scaling strategies that maintain conversation coherence under heavy load.

Why Horizontal Scaling Matters for AI Agents

A single AI agent server can typically handle 50 to 200 concurrent conversations before response latency degrades. Each conversation involves holding context in memory, making LLM API calls that block for seconds, and streaming responses back to clients. Vertical scaling — adding more CPU and RAM to one machine — hits a ceiling quickly because the bottleneck is I/O-bound concurrency, not raw compute.

Horizontal scaling adds more server instances behind a load balancer so that thousands of concurrent sessions are distributed across a fleet. The challenge is that AI agent conversations are stateful — each turn depends on the history of previous turns. Designing around this statefulness is the core engineering problem.

Stateless Agent Design

The first principle is to externalize all conversation state. Instead of holding session data in memory on the server process, persist it to a shared store like Redis or a database after every turn:

flowchart LR
    USERS(["Traffic"])
    LB["Geo LB plus<br/>Anycast"]
    EDGE["Edge cache plus<br/>rate limit"]
    APP["Stateless app pods<br/>HPA on QPS"]
    QUEUE[(Async work queue)]
    WORKER["Worker pool<br/>GPU or CPU"]
    CACHE[("Redis cache<br/>LLM responses")]
    DB[("Read replicas<br/>and primary")]
    OBS[(Observability)]
    USERS --> LB --> EDGE --> APP
    APP --> CACHE
    APP --> QUEUE --> WORKER
    APP --> DB
    APP --> OBS
    style LB fill:#4f46e5,stroke:#4338ca,color:#fff
    style WORKER fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style CACHE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#0ea5e9,stroke:#0369a1,color:#fff
import redis
import json
from fastapi import FastAPI, Request

app = FastAPI()
r = redis.Redis(host="redis-cluster", port=6379, decode_responses=True)

SESSION_TTL = 3600  # 1 hour

async def get_session(session_id: str) -> dict:
    data = r.get(f"agent:session:{session_id}")
    if data is None:
        return {"messages": [], "metadata": {}}
    return json.loads(data)

async def save_session(session_id: str, session: dict):
    r.setex(
        f"agent:session:{session_id}",
        SESSION_TTL,
        json.dumps(session),
    )

@app.post("/chat/{session_id}")
async def chat(session_id: str, request: Request):
    body = await request.json()
    session = await get_session(session_id)
    session["messages"].append({"role": "user", "content": body["message"]})

    # Call LLM with full conversation history
    response = await call_agent(session["messages"], session["metadata"])

    session["messages"].append({"role": "assistant", "content": response})
    await save_session(session_id, session)
    return {"response": response}

With this pattern, any server instance can handle any request for any session. The server itself holds no state between requests — it reads state from Redis, processes the turn, and writes state back.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Load Balancer Configuration

For stateless agent servers, round-robin or least-connections load balancing works well. However, if your agent uses WebSocket streaming, you need session affinity (sticky sessions) for the duration of a single streaming response:

# Kubernetes Ingress with sticky sessions for WebSocket
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: agent-ingress
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "agent-route"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "120"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "120"
spec:
  rules:
    - host: agents.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: agent-service
                port:
                  number: 8000

The cookie-based affinity ensures a client reconnects to the same pod during an active streaming session, while new sessions are distributed evenly across the fleet.

Auto-Scaling Based on Concurrent Connections

CPU-based auto-scaling is a poor fit for AI agent workloads because servers spend most of their time waiting on LLM API calls. Instead, scale based on active connection count or request concurrency:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-deployment
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric:
          name: active_ws_connections
        target:
          type: AverageValue
          averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 5
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

This scales up aggressively when each pod averages over 100 active WebSocket connections and scales down conservatively to avoid dropping live conversations.

Graceful Shutdown and Connection Draining

When scaling down, pods must finish in-flight conversations before terminating. Configure a preStop hook and a generous termination grace period:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

spec:
  terminationGracePeriodSeconds: 120
  containers:
    - name: agent
      lifecycle:
        preStop:
          exec:
            command:
              - /bin/sh
              - -c
              - "curl -s localhost:8000/drain && sleep 90"

The drain endpoint tells the server to stop accepting new connections and wait for active conversations to complete their current turn before shutting down.

FAQ

How many concurrent sessions can a single Python agent server handle?

With asyncio and an async framework like FastAPI, a single server can handle 100 to 300 concurrent sessions when the primary bottleneck is waiting on LLM API responses. The actual limit depends on memory per session (typically 50 to 200 KB of conversation context) and the timeout duration of LLM calls.

Should I use sticky sessions or fully stateless routing?

Use fully stateless routing when you externalize all session state to Redis or a database — this gives maximum flexibility for scaling. Use sticky sessions only for the duration of a single WebSocket streaming response, not for the entire conversation lifecycle.

What happens to active conversations during a deployment rollout?

Configure rolling updates with maxSurge: 1 and maxUnavailable: 0 so that new pods come up before old ones terminate. Combined with connection draining and a termination grace period, active conversations complete their current turn on the old pod, and the next turn routes to a new pod.


#HorizontalScaling #AIAgents #LoadBalancing #AutoScaling #DistributedSystems #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Agents

Personal AI Assistant: How to Pick One for Business in 2026

A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.

AI Agents

Free AI Agents in 2026: When Free Wins and When It Costs You

A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.

Agentic AI

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.

AI Agents

Chatbot App vs ChatGPT: What's the Difference, and Which Do I Need?

Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.

HVAC

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.

AI Engineering

A2A Multi-Agent Architecture Patterns (2026 Reference)

Five proven multi-agent architecture patterns built on A2A — orchestrator, peer mesh, hub-and-spoke, marketplace, and tiered specialist.