---
title: "Connection Pooling for AI Applications: Reusing HTTP Connections Across LLM Calls"
description: "Learn to configure HTTP connection pooling with httpx and aiohttp for AI applications. Reduce latency, manage connection limits, and optimize DNS caching for LLM API calls."
canonical: https://callsphere.ai/blog/connection-pooling-ai-applications-reusing-http-connections-llm-calls
category: "Learn Agentic AI"
tags: ["Python", "Connection Pooling", "httpx", "aiohttp", "Performance"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-06-01T20:10:11.060Z
---

# Connection Pooling for AI Applications: Reusing HTTP Connections Across LLM Calls

> Learn to configure HTTP connection pooling with httpx and aiohttp for AI applications. Reduce latency, manage connection limits, and optimize DNS caching for LLM API calls.

## Why Connection Pooling Matters for LLM Applications

Every HTTP request to an LLM API involves a TCP handshake (one round-trip), a TLS handshake (two more round-trips), and possibly a DNS lookup. For a server 50ms away, that is 150ms of overhead before you send a single byte of your prompt. When your agent makes 20 LLM calls per user request, that overhead adds up to 3 seconds of pure connection setup.

Connection pooling eliminates this by reusing established TCP connections across multiple requests. Once the initial connection is established, subsequent requests skip the handshake entirely and start transmitting immediately.

## httpx Connection Pool Configuration

httpx is the recommended async HTTP client for modern Python applications. It provides fine-grained control over connection pooling.

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
import httpx

# Configure a connection pool tuned for LLM API access
limits = httpx.Limits(
    max_connections=100,        # Total connections across all hosts
    max_keepalive_connections=20,  # Idle connections to keep alive
    keepalive_expiry=30.0,      # Seconds before idle conn is closed
)

client = httpx.AsyncClient(
    limits=limits,
    timeout=httpx.Timeout(
        connect=5.0,    # Max time to establish connection
        read=60.0,      # Max time to read response (LLMs are slow)
        write=10.0,     # Max time to send request
        pool=10.0,      # Max time waiting for available connection
    ),
    http2=True,  # HTTP/2 multiplexes requests over a single conn
    headers={"Authorization": f"Bearer {API_KEY}"},
)
```

The critical parameters:

- **max_connections** controls how many simultaneous TCP connections the client maintains. Set this to match your concurrency level.
- **max_keepalive_connections** determines how many idle connections stay alive between bursts of requests.
- **keepalive_expiry** balances resource usage against reconnection overhead.
- **http2** enables multiplexing multiple requests over a single connection, which is particularly effective for LLM APIs.

## Lifecycle Management: Application-Scoped Clients

The most common mistake is creating a new client per request. Always scope the client to your application lifetime.

```python
from contextlib import asynccontextmanager
from fastapi import FastAPI

class LLMService:
    """LLM service with connection pool lifecycle management."""

    def __init__(self):
        self._client: httpx.AsyncClient | None = None

    async def start(self):
        self._client = httpx.AsyncClient(
            limits=httpx.Limits(
                max_connections=50,
                max_keepalive_connections=10,
            ),
            timeout=httpx.Timeout(connect=5.0, read=120.0),
            http2=True,
            base_url="https://api.openai.com/v1",
            headers={"Authorization": f"Bearer {API_KEY}"},
        )

    async def stop(self):
        if self._client:
            await self._client.aclose()

    async def complete(self, messages: list[dict]) -> str:
        response = await self._client.post(
            "/chat/completions",
            json={"model": "gpt-4o", "messages": messages},
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]

llm_service = LLMService()

@asynccontextmanager
async def lifespan(app: FastAPI):
    await llm_service.start()
    yield
    await llm_service.stop()

app = FastAPI(lifespan=lifespan)
```

## aiohttp Connection Pooling

aiohttp uses `TCPConnector` to manage connection pools. It offers additional options like DNS caching.

```python
import aiohttp

connector = aiohttp.TCPConnector(
    limit=100,               # Max total connections
    limit_per_host=30,       # Max connections per host
    ttl_dns_cache=300,       # Cache DNS lookups for 5 minutes
    use_dns_cache=True,      # Enable DNS caching
    keepalive_timeout=30,    # Keep idle connections for 30s
    enable_cleanup_closed=True,  # Clean up closed connections
)

async def create_session() -> aiohttp.ClientSession:
    return aiohttp.ClientSession(
        connector=connector,
        timeout=aiohttp.ClientTimeout(
            total=120,      # Total request timeout
            connect=5,      # Connection establishment timeout
            sock_read=60,   # Socket read timeout
        ),
        headers={"Authorization": f"Bearer {API_KEY}"},
    )
```

## DNS Caching

DNS resolution adds 5-50ms per request without caching. Both httpx and aiohttp can cache DNS lookups to eliminate this.

```python
# aiohttp has built-in DNS caching via TCPConnector
connector = aiohttp.TCPConnector(
    use_dns_cache=True,
    ttl_dns_cache=300,  # 5-minute cache TTL
)

# For httpx, use a custom transport with caching
# httpx does DNS caching automatically within connection
# pool lifetime — connections are reused, so DNS is
# only resolved once per keepalive window
```

## Monitoring Connection Pool Health

In production, monitor your pool to detect exhaustion and connection leaks.

```python
import logging

logger = logging.getLogger("llm_pool")

class MonitoredLLMClient:
    def __init__(self, max_connections: int = 50):
        self._max = max_connections
        self._active = 0
        self._client = httpx.AsyncClient(
            limits=httpx.Limits(max_connections=max_connections),
            timeout=httpx.Timeout(connect=5.0, read=120.0),
        )

    async def request(self, messages: list[dict]) -> str:
        self._active += 1
        utilization = self._active / self._max
        if utilization > 0.8:
            logger.warning(
                f"Pool utilization high: {self._active}/{self._max} "
                f"({utilization:.0%})"
            )
        try:
            resp = await self._client.post(
                "https://api.openai.com/v1/chat/completions",
                json={"model": "gpt-4o", "messages": messages},
            )
            return resp.json()["choices"][0]["message"]["content"]
        finally:
            self._active -= 1
```

## FAQ

### How many max_connections should I set for LLM API calls?

Match it to your maximum expected concurrency. If your application handles 50 concurrent user requests and each makes 1-2 LLM calls, set max_connections to 50-100. Setting it too high wastes resources; too low causes requests to queue waiting for connections. Monitor pool utilization in production and adjust.

### Should I use HTTP/2 for LLM API calls?

Yes, when the API supports it. HTTP/2 multiplexes multiple requests over a single TCP connection, reducing connection overhead dramatically. OpenAI and Anthropic APIs support HTTP/2. Enable it with `http2=True` in httpx (requires the `h2` package installed).

### What happens when the connection pool is exhausted?

Requests wait in a queue until a connection becomes available, up to the pool timeout. In httpx, this is the `pool` timeout parameter. If the timeout expires, an `httpx.PoolTimeout` exception is raised. Handle this by either increasing pool size or implementing request queuing with backpressure.

---

#Python #ConnectionPooling #Httpx #Aiohttp #Performance #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/connection-pooling-ai-applications-reusing-http-connections-llm-calls
