Skip to content
Learn Agentic AI
Learn Agentic AI9 min read1 views

Reducing Time-to-First-Token in AI Agents: Connection Reuse, Warm Pools, and Prefetching

Learn how to minimize the delay between a user request and the first visible response from your AI agent by optimizing connections, DNS caching, request pipelining, and warm pool strategies.

What Is Time-to-First-Token and Why It Matters

Time-to-First-Token (TTFT) is the duration between when a user submits a request and when the first token of the AI response becomes visible. In conversational AI agents, TTFT directly shapes user perception of speed. A 2-second TTFT feels snappy. A 5-second TTFT feels broken — even if the total generation time is identical.

Most of the TTFT budget is not spent inside the LLM. It is consumed by network overhead: DNS resolution, TCP handshake, TLS negotiation, and HTTP request serialization. Optimizing these layers can shave 200-800ms off every single request.

Connection Reuse with HTTP Keep-Alive

Every new HTTPS connection to an LLM provider requires a DNS lookup, TCP three-way handshake, and TLS negotiation. On a cold connection to OpenAI or Anthropic, this adds 150-400ms. Connection reuse eliminates this overhead for subsequent requests.

flowchart TD
    START["Reducing Time-to-First-Token in AI Agents: Connec…"] --> A
    A["What Is Time-to-First-Token and Why It …"]
    A --> B
    B["Connection Reuse with HTTP Keep-Alive"]
    B --> C
    C["DNS Caching"]
    C --> D
    D["Warm Pools: Pre-Establishing Connections"]
    D --> E
    E["Request Prefetching for Predictable Wor…"]
    E --> F
    F["Measuring TTFT in Practice"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import httpx
import asyncio

# BAD: Creating a new client per request
async def slow_completion(prompt: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": "Bearer sk-..."},
            json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
        )
        return response.json()["choices"][0]["message"]["content"]

# GOOD: Reuse a single client across all requests
class LLMClient:
    def __init__(self):
        self._client = httpx.AsyncClient(
            timeout=httpx.Timeout(30.0, connect=5.0),
            limits=httpx.Limits(
                max_connections=20,
                max_keepalive_connections=10,
                keepalive_expiry=120,
            ),
            http2=True,
        )

    async def completion(self, prompt: str) -> str:
        response = await self._client.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": "Bearer sk-..."},
            json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
        )
        return response.json()["choices"][0]["message"]["content"]

    async def close(self):
        await self._client.aclose()

The httpx.AsyncClient with http2=True enables multiplexed streams over a single connection, meaning multiple LLM calls share one TLS session.

DNS Caching

DNS resolution adds 20-80ms per cold lookup. Python does not cache DNS results by default. You can add a resolver cache that persists across requests.

import httpx
from httpx._transports.default import AsyncHTTPTransport

# Configure transport with connection pooling
transport = AsyncHTTPTransport(
    retries=2,
    http2=True,
)

client = httpx.AsyncClient(
    transport=transport,
    timeout=httpx.Timeout(30.0, connect=5.0),
)

At the infrastructure level, running a local DNS cache like dnsmasq or using systemd-resolved with caching enabled eliminates repeated lookups entirely.

Warm Pools: Pre-Establishing Connections

A warm pool pre-establishes connections before any user request arrives. When the first request comes in, the TCP and TLS handshake are already complete.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import asyncio
import httpx

class WarmLLMPool:
    def __init__(self, base_url: str, api_key: str, pool_size: int = 5):
        self.client = httpx.AsyncClient(
            base_url=base_url,
            headers={"Authorization": f"Bearer {api_key}"},
            limits=httpx.Limits(
                max_connections=pool_size,
                max_keepalive_connections=pool_size,
            ),
            http2=True,
            timeout=httpx.Timeout(30.0),
        )

    async def warm_up(self):
        """Pre-establish connections by sending lightweight requests."""
        tasks = [
            self.client.get("/v1/models")
            for _ in range(3)
        ]
        await asyncio.gather(*tasks, return_exceptions=True)

    async def complete(self, messages: list[dict]) -> str:
        response = await self.client.post(
            "/v1/chat/completions",
            json={"model": "gpt-4o", "messages": messages, "max_tokens": 1},
        )
        return response.json()

# During application startup
pool = WarmLLMPool("https://api.openai.com", "sk-...")
await pool.warm_up()

Call warm_up() during your application's startup phase — in FastAPI this goes inside the lifespan handler, in Django it goes in AppConfig.ready().

Request Prefetching for Predictable Workflows

When your agent follows predictable patterns — like always retrieving user context before generating a response — you can prefetch data while the user is still typing.

import asyncio

class PrefetchingAgent:
    def __init__(self, llm_client, user_store):
        self.llm = llm_client
        self.users = user_store
        self._prefetch_cache: dict[str, asyncio.Task] = {}

    async def on_typing_started(self, user_id: str):
        """Trigger prefetch when user starts typing."""
        if user_id not in self._prefetch_cache:
            self._prefetch_cache[user_id] = asyncio.create_task(
                self.users.get_context(user_id)
            )

    async def handle_message(self, user_id: str, message: str):
        # Retrieve prefetched context (already in flight or completed)
        task = self._prefetch_cache.pop(user_id, None)
        if task:
            context = await task
        else:
            context = await self.users.get_context(user_id)

        return await self.llm.completion(
            f"User context: {context}\nUser: {message}"
        )

This pattern overlaps network I/O with user think time, reducing perceived TTFT by the full duration of the prefetch.

Measuring TTFT in Practice

Always measure TTFT from the client perspective, not server-side. Use structured logging to track each phase.

import time

async def timed_completion(client, messages):
    t_start = time.perf_counter()
    response = await client.post(
        "/v1/chat/completions",
        json={"model": "gpt-4o", "messages": messages, "stream": True},
    )
    t_first_byte = time.perf_counter()

    chunks = []
    async for chunk in response.aiter_bytes():
        if not chunks:
            t_first_token = time.perf_counter()
        chunks.append(chunk)

    return {
        "ttfb_ms": (t_first_byte - t_start) * 1000,
        "ttft_ms": (t_first_token - t_start) * 1000,
        "total_ms": (time.perf_counter() - t_start) * 1000,
    }

FAQ

How much latency does connection reuse actually save?

On a typical HTTPS connection to a major LLM provider, the cold connection overhead is 150-400ms (DNS + TCP + TLS). Connection reuse eliminates all of this for subsequent requests. Over a conversation with 10 turns, that saves 1.5-4 seconds of cumulative wait time.

Should I use HTTP/2 for LLM API calls?

Yes. HTTP/2 multiplexes multiple requests over a single TCP connection, which is valuable when your agent makes parallel tool calls or sends multiple completions simultaneously. Libraries like httpx support it natively with http2=True.

What is a good TTFT target for conversational AI agents?

Under 500ms is excellent, under 1 second is acceptable for most applications, and anything over 2 seconds will feel sluggish to users. These targets include network overhead but exclude the actual model inference time at the provider.


#Performance #TTFT #ConnectionPooling #Latency #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide

Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Sub-500ms Latency Voice Agents: Architecture Patterns for Production Deployment

Technical deep dive into achieving under 500ms voice agent latency with streaming architectures, edge deployment, connection pooling, pre-warming, and async tool execution.

Learn Agentic AI

Token-Efficient Agent Design: Reducing LLM Costs Without Sacrificing Quality

Practical strategies for reducing LLM token costs in agentic systems including compact prompts, tool result summarization, selective context, and model tiering approaches.