Skip to content
Learn Agentic AI
Learn Agentic AI14 min read2 views

Database Scaling for AI Agents: Connection Pooling, Read Replicas, and Sharding

Master database scaling techniques for AI agent platforms including PgBouncer connection pooling, read/write splitting with replicas, horizontal sharding strategies, and migration patterns that keep your agents responsive under heavy load.

The Database Bottleneck in AI Agent Systems

AI agent platforms hit database bottlenecks faster than typical web applications. Each agent conversation generates a high volume of reads and writes — fetching conversation history, persisting each turn, logging tool calls, storing retrieved documents, and updating session metadata. A platform with 1,000 concurrent agent sessions might produce 10,000 to 50,000 database operations per minute.

PostgreSQL handles this well up to a point, but three specific problems emerge at scale: connection exhaustion, read contention, and table size.

Connection Pooling with PgBouncer

PostgreSQL creates a new process for each client connection. With 50 agent worker pods each maintaining a pool of 20 connections, you need 1,000 PostgreSQL backend processes — each consuming 5 to 10 MB of RAM. At this scale, the database server spends more resources managing connections than executing queries.

flowchart TD
    START["Database Scaling for AI Agents: Connection Poolin…"] --> A
    A["The Database Bottleneck in AI Agent Sys…"]
    A --> B
    B["Connection Pooling with PgBouncer"]
    B --> C
    C["Read/Write Splitting"]
    C --> D
    D["Sharding by Tenant or Session"]
    D --> E
    E["Migrating to a Sharded Architecture"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

PgBouncer sits between your application and PostgreSQL, multiplexing hundreds of application connections over a smaller number of actual database connections:

# pgbouncer.ini mounted as ConfigMap
[databases]
agents = host=postgres-primary port=5432 dbname=agents

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Transaction pooling: connection returned after each transaction
pool_mode = transaction
default_pool_size = 50
max_client_conn = 2000
max_db_connections = 100
reserve_pool_size = 10
reserve_pool_timeout = 3

# Timeout settings for long-running agent queries
query_timeout = 30
client_idle_timeout = 300
server_idle_timeout = 60

Transaction pooling mode is the right choice for AI agent workloads. Each agent request grabs a connection, runs its transaction (read history, write new turn), and immediately returns the connection. This lets 2,000 application connections share 100 actual PostgreSQL connections.

Deploy PgBouncer as a sidecar or separate pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pgbouncer
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: pgbouncer
          image: edoburu/pgbouncer:1.22.0
          ports:
            - containerPort: 6432
          volumeMounts:
            - name: config
              mountPath: /etc/pgbouncer
          resources:
            requests:
              cpu: "200m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"

Read/Write Splitting

Most AI agent database operations are reads — fetching conversation history, looking up tool configurations, loading prompt templates. Sending reads to replicas frees the primary for writes:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from sqlalchemy import create_engine
from sqlalchemy.orm import Session
import random

write_engine = create_engine(
    "postgresql://user:pass@pgbouncer-primary:6432/agents",
    pool_size=20,
    pool_pre_ping=True,
)
read_engines = [
    create_engine(
        f"postgresql://user:pass@pgbouncer-replica-{i}:6432/agents",
        pool_size=20,
        pool_pre_ping=True,
    )
    for i in range(3)
]

def get_write_session() -> Session:
    return Session(bind=write_engine)

def get_read_session() -> Session:
    engine = random.choice(read_engines)
    return Session(bind=engine)

# Usage in agent code
def get_conversation_history(session_id: str) -> list:
    with get_read_session() as session:
        return session.execute(
            "SELECT role, content, created_at "
            "FROM messages WHERE session_id = :sid "
            "ORDER BY created_at",
            {"sid": session_id},
        ).fetchall()

def save_message(session_id: str, role: str, content: str):
    with get_write_session() as session:
        session.execute(
            "INSERT INTO messages (session_id, role, content) "
            "VALUES (:sid, :role, :content)",
            {"sid": session_id, "role": role, "content": content},
        )
        session.commit()

One caveat: replication lag means a message you just wrote may not appear on a replica for 10 to 100 milliseconds. For the critical path where you write a user message and immediately need to read the full history to send to the LLM, read from the primary. Use replicas for dashboard queries, analytics, and search operations where slight staleness is acceptable.

Sharding by Tenant or Session

When a single PostgreSQL instance cannot hold all conversation data even with replicas, shard horizontally. The most natural shard key for AI agent platforms is tenant ID (for multi-tenant platforms) or a hash of the session ID:

import hashlib

SHARD_COUNT = 4
shard_engines = {
    i: create_engine(f"postgresql://user:pass@shard-{i}:6432/agents")
    for i in range(SHARD_COUNT)
}

def get_shard(session_id: str) -> int:
    hash_val = int(hashlib.sha256(session_id.encode()).hexdigest(), 16)
    return hash_val % SHARD_COUNT

def get_session_engine(session_id: str):
    shard = get_shard(session_id)
    return shard_engines[shard]

This ensures all messages for a given session live on the same shard, so fetching conversation history never requires cross-shard queries.

Migrating to a Sharded Architecture

The migration from a single database to shards follows this sequence: add the shard column to your schema, deploy the routing layer to write to the correct shard while still reading from the original database, backfill historical data to shards, then switch reads to the sharded path. Run both paths in parallel with comparison logging before cutting over fully.

FAQ

When should I introduce PgBouncer versus just increasing PostgreSQL max_connections?

Introduce PgBouncer when you have more than 200 concurrent connections or more than 10 application pods. Increasing max_connections beyond 200 to 300 degrades PostgreSQL performance because each connection is a separate process with its own memory allocation. PgBouncer multiplexes connections efficiently with minimal resource overhead.

How do I handle replication lag when an agent writes a message and immediately reads it back?

For the write-then-read-immediately pattern, always read from the primary. Route only non-critical reads — dashboards, analytics, search — to replicas. Alternatively, some connection poolers support "primary stickiness" where reads after a write in the same transaction or within a short time window automatically route to the primary.

Is sharding worth the complexity for AI agent platforms?

Only if a single PostgreSQL instance (with replicas) cannot handle your data volume. Most platforms can handle millions of conversations on a single well-tuned PostgreSQL instance with proper indexes and partitioning. Shard when you exceed 1 to 2 TB of conversation data or need to isolate tenants for compliance reasons.


#DatabaseScaling #PostgreSQL #ConnectionPooling #PgBouncer #Sharding #AIAgents #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Learn Agentic AI

API Design for AI Agent Tool Functions: Best Practices and Anti-Patterns

How to design tool functions that LLMs can use effectively with clear naming, enum parameters, structured responses, informative error messages, and documentation.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

Google Cloud AI Agent Trends Report 2026: Key Findings and Developer Implications

Analysis of Google Cloud's 2026 AI agent trends report covering Gemini-powered agents, Google ADK, Vertex AI agent builder, and enterprise adoption patterns.