Reducing Time-to-First-Token in AI Agents: Connection Reuse, Warm Pools, and Prefetching

What Is Time-to-First-Token and Why It Matters

Time-to-First-Token (TTFT) is the duration between when a user submits a request and when the first token of the AI response becomes visible. In conversational AI agents, TTFT directly shapes user perception of speed. A 2-second TTFT feels snappy. A 5-second TTFT feels broken — even if the total generation time is identical.

Most of the TTFT budget is not spent inside the LLM. It is consumed by network overhead: DNS resolution, TCP handshake, TLS negotiation, and HTTP request serialization. Optimizing these layers can shave 200-800ms off every single request.

Connection Reuse with HTTP Keep-Alive

Every new HTTPS connection to an LLM provider requires a DNS lookup, TCP three-way handshake, and TLS negotiation. On a cold connection to OpenAI or Anthropic, this adds 150-400ms. Connection reuse eliminates this overhead for subsequent requests.

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

import httpx
import asyncio

# BAD: Creating a new client per request
async def slow_completion(prompt: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": "Bearer sk-..."},
            json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
        )
        return response.json()["choices"][0]["message"]["content"]

# GOOD: Reuse a single client across all requests
class LLMClient:
    def __init__(self):
        self._client = httpx.AsyncClient(
            timeout=httpx.Timeout(30.0, connect=5.0),
            limits=httpx.Limits(
                max_connections=20,
                max_keepalive_connections=10,
                keepalive_expiry=120,
            ),
            http2=True,
        )

    async def completion(self, prompt: str) -> str:
        response = await self._client.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": "Bearer sk-..."},
            json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
        )
        return response.json()["choices"][0]["message"]["content"]

    async def close(self):
        await self._client.aclose()

The httpx.AsyncClient with http2=True enables multiplexed streams over a single connection, meaning multiple LLM calls share one TLS session.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

DNS Caching

DNS resolution adds 20-80ms per cold lookup. Python does not cache DNS results by default. You can add a resolver cache that persists across requests.

import httpx
from httpx._transports.default import AsyncHTTPTransport

# Configure transport with connection pooling
transport = AsyncHTTPTransport(
    retries=2,
    http2=True,
)

client = httpx.AsyncClient(
    transport=transport,
    timeout=httpx.Timeout(30.0, connect=5.0),
)

At the infrastructure level, running a local DNS cache like dnsmasq or using systemd-resolved with caching enabled eliminates repeated lookups entirely.

Warm Pools: Pre-Establishing Connections

A warm pool pre-establishes connections before any user request arrives. When the first request comes in, the TCP and TLS handshake are already complete.

import asyncio
import httpx

class WarmLLMPool:
    def __init__(self, base_url: str, api_key: str, pool_size: int = 5):
        self.client = httpx.AsyncClient(
            base_url=base_url,
            headers={"Authorization": f"Bearer {api_key}"},
            limits=httpx.Limits(
                max_connections=pool_size,
                max_keepalive_connections=pool_size,
            ),
            http2=True,
            timeout=httpx.Timeout(30.0),
        )

    async def warm_up(self):
        """Pre-establish connections by sending lightweight requests."""
        tasks = [
            self.client.get("/v1/models")
            for _ in range(3)
        ]
        await asyncio.gather(*tasks, return_exceptions=True)

    async def complete(self, messages: list[dict]) -> str:
        response = await self.client.post(
            "/v1/chat/completions",
            json={"model": "gpt-4o", "messages": messages, "max_tokens": 1},
        )
        return response.json()

# During application startup
pool = WarmLLMPool("https://api.openai.com", "sk-...")
await pool.warm_up()

Call warm_up() during your application's startup phase — in FastAPI this goes inside the lifespan handler, in Django it goes in AppConfig.ready().

Request Prefetching for Predictable Workflows

When your agent follows predictable patterns — like always retrieving user context before generating a response — you can prefetch data while the user is still typing.

import asyncio

class PrefetchingAgent:
    def __init__(self, llm_client, user_store):
        self.llm = llm_client
        self.users = user_store
        self._prefetch_cache: dict[str, asyncio.Task] = {}

    async def on_typing_started(self, user_id: str):
        """Trigger prefetch when user starts typing."""
        if user_id not in self._prefetch_cache:
            self._prefetch_cache[user_id] = asyncio.create_task(
                self.users.get_context(user_id)
            )

    async def handle_message(self, user_id: str, message: str):
        # Retrieve prefetched context (already in flight or completed)
        task = self._prefetch_cache.pop(user_id, None)
        if task:
            context = await task
        else:
            context = await self.users.get_context(user_id)

        return await self.llm.completion(
            f"User context: {context}\nUser: {message}"
        )

This pattern overlaps network I/O with user think time, reducing perceived TTFT by the full duration of the prefetch.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Measuring TTFT in Practice

Always measure TTFT from the client perspective, not server-side. Use structured logging to track each phase.

import time

async def timed_completion(client, messages):
    t_start = time.perf_counter()
    response = await client.post(
        "/v1/chat/completions",
        json={"model": "gpt-4o", "messages": messages, "stream": True},
    )
    t_first_byte = time.perf_counter()

    chunks = []
    async for chunk in response.aiter_bytes():
        if not chunks:
            t_first_token = time.perf_counter()
        chunks.append(chunk)

    return {
        "ttfb_ms": (t_first_byte - t_start) * 1000,
        "ttft_ms": (t_first_token - t_start) * 1000,
        "total_ms": (time.perf_counter() - t_start) * 1000,
    }

FAQ

How much latency does connection reuse actually save?

On a typical HTTPS connection to a major LLM provider, the cold connection overhead is 150-400ms (DNS + TCP + TLS). Connection reuse eliminates all of this for subsequent requests. Over a conversation with 10 turns, that saves 1.5-4 seconds of cumulative wait time.

Should I use HTTP/2 for LLM API calls?

Yes. HTTP/2 multiplexes multiple requests over a single TCP connection, which is valuable when your agent makes parallel tool calls or sends multiple completions simultaneously. Libraries like httpx support it natively with http2=True.

What is a good TTFT target for conversational AI agents?

Under 500ms is excellent, under 1 second is acceptable for most applications, and anything over 2 seconds will feel sluggish to users. These targets include network overhead but exclude the actual model inference time at the provider.

#Performance #TTFT #ConnectionPooling #Latency #Python #AgenticAI #LearnAI #AIEngineering

Reducing Time-to-First-Token in AI Agents: Connection Reuse, Warm Pools, and Prefetching

What Is Time-to-First-Token and Why It Matters

Connection Reuse with HTTP Keep-Alive

DNS Caching

Warm Pools: Pre-Establishing Connections

Request Prefetching for Predictable Workflows

Measuring TTFT in Practice

FAQ

How much latency does connection reuse actually save?

Should I use HTTP/2 for LLM API calls?

What is a good TTFT target for conversational AI agents?

Try CallSphere AI Voice Agents

Related Articles You May Like

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Latency Benchmarking AI Voice Agent Vendors (2026)

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough