Reducing Time-to-First-Token in AI Agents: Connection Reuse, Warm Pools, and Prefetching
Learn how to minimize the delay between a user request and the first visible response from your AI agent by optimizing connections, DNS caching, request pipelining, and warm pool strategies.
What Is Time-to-First-Token and Why It Matters
Time-to-First-Token (TTFT) is the duration between when a user submits a request and when the first token of the AI response becomes visible. In conversational AI agents, TTFT directly shapes user perception of speed. A 2-second TTFT feels snappy. A 5-second TTFT feels broken — even if the total generation time is identical.
Most of the TTFT budget is not spent inside the LLM. It is consumed by network overhead: DNS resolution, TCP handshake, TLS negotiation, and HTTP request serialization. Optimizing these layers can shave 200-800ms off every single request.
Connection Reuse with HTTP Keep-Alive
Every new HTTPS connection to an LLM provider requires a DNS lookup, TCP three-way handshake, and TLS negotiation. On a cold connection to OpenAI or Anthropic, this adds 150-400ms. Connection reuse eliminates this overhead for subsequent requests.
flowchart TD
START["Reducing Time-to-First-Token in AI Agents: Connec…"] --> A
A["What Is Time-to-First-Token and Why It …"]
A --> B
B["Connection Reuse with HTTP Keep-Alive"]
B --> C
C["DNS Caching"]
C --> D
D["Warm Pools: Pre-Establishing Connections"]
D --> E
E["Request Prefetching for Predictable Wor…"]
E --> F
F["Measuring TTFT in Practice"]
F --> G
G["FAQ"]
G --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
import httpx
import asyncio
# BAD: Creating a new client per request
async def slow_completion(prompt: str) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": "Bearer sk-..."},
json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
)
return response.json()["choices"][0]["message"]["content"]
# GOOD: Reuse a single client across all requests
class LLMClient:
def __init__(self):
self._client = httpx.AsyncClient(
timeout=httpx.Timeout(30.0, connect=5.0),
limits=httpx.Limits(
max_connections=20,
max_keepalive_connections=10,
keepalive_expiry=120,
),
http2=True,
)
async def completion(self, prompt: str) -> str:
response = await self._client.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": "Bearer sk-..."},
json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
)
return response.json()["choices"][0]["message"]["content"]
async def close(self):
await self._client.aclose()
The httpx.AsyncClient with http2=True enables multiplexed streams over a single connection, meaning multiple LLM calls share one TLS session.
DNS Caching
DNS resolution adds 20-80ms per cold lookup. Python does not cache DNS results by default. You can add a resolver cache that persists across requests.
import httpx
from httpx._transports.default import AsyncHTTPTransport
# Configure transport with connection pooling
transport = AsyncHTTPTransport(
retries=2,
http2=True,
)
client = httpx.AsyncClient(
transport=transport,
timeout=httpx.Timeout(30.0, connect=5.0),
)
At the infrastructure level, running a local DNS cache like dnsmasq or using systemd-resolved with caching enabled eliminates repeated lookups entirely.
Warm Pools: Pre-Establishing Connections
A warm pool pre-establishes connections before any user request arrives. When the first request comes in, the TCP and TLS handshake are already complete.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import asyncio
import httpx
class WarmLLMPool:
def __init__(self, base_url: str, api_key: str, pool_size: int = 5):
self.client = httpx.AsyncClient(
base_url=base_url,
headers={"Authorization": f"Bearer {api_key}"},
limits=httpx.Limits(
max_connections=pool_size,
max_keepalive_connections=pool_size,
),
http2=True,
timeout=httpx.Timeout(30.0),
)
async def warm_up(self):
"""Pre-establish connections by sending lightweight requests."""
tasks = [
self.client.get("/v1/models")
for _ in range(3)
]
await asyncio.gather(*tasks, return_exceptions=True)
async def complete(self, messages: list[dict]) -> str:
response = await self.client.post(
"/v1/chat/completions",
json={"model": "gpt-4o", "messages": messages, "max_tokens": 1},
)
return response.json()
# During application startup
pool = WarmLLMPool("https://api.openai.com", "sk-...")
await pool.warm_up()
Call warm_up() during your application's startup phase — in FastAPI this goes inside the lifespan handler, in Django it goes in AppConfig.ready().
Request Prefetching for Predictable Workflows
When your agent follows predictable patterns — like always retrieving user context before generating a response — you can prefetch data while the user is still typing.
import asyncio
class PrefetchingAgent:
def __init__(self, llm_client, user_store):
self.llm = llm_client
self.users = user_store
self._prefetch_cache: dict[str, asyncio.Task] = {}
async def on_typing_started(self, user_id: str):
"""Trigger prefetch when user starts typing."""
if user_id not in self._prefetch_cache:
self._prefetch_cache[user_id] = asyncio.create_task(
self.users.get_context(user_id)
)
async def handle_message(self, user_id: str, message: str):
# Retrieve prefetched context (already in flight or completed)
task = self._prefetch_cache.pop(user_id, None)
if task:
context = await task
else:
context = await self.users.get_context(user_id)
return await self.llm.completion(
f"User context: {context}\nUser: {message}"
)
This pattern overlaps network I/O with user think time, reducing perceived TTFT by the full duration of the prefetch.
Measuring TTFT in Practice
Always measure TTFT from the client perspective, not server-side. Use structured logging to track each phase.
import time
async def timed_completion(client, messages):
t_start = time.perf_counter()
response = await client.post(
"/v1/chat/completions",
json={"model": "gpt-4o", "messages": messages, "stream": True},
)
t_first_byte = time.perf_counter()
chunks = []
async for chunk in response.aiter_bytes():
if not chunks:
t_first_token = time.perf_counter()
chunks.append(chunk)
return {
"ttfb_ms": (t_first_byte - t_start) * 1000,
"ttft_ms": (t_first_token - t_start) * 1000,
"total_ms": (time.perf_counter() - t_start) * 1000,
}
FAQ
How much latency does connection reuse actually save?
On a typical HTTPS connection to a major LLM provider, the cold connection overhead is 150-400ms (DNS + TCP + TLS). Connection reuse eliminates all of this for subsequent requests. Over a conversation with 10 turns, that saves 1.5-4 seconds of cumulative wait time.
Should I use HTTP/2 for LLM API calls?
Yes. HTTP/2 multiplexes multiple requests over a single TCP connection, which is valuable when your agent makes parallel tool calls or sends multiple completions simultaneously. Libraries like httpx support it natively with http2=True.
What is a good TTFT target for conversational AI agents?
Under 500ms is excellent, under 1 second is acceptable for most applications, and anything over 2 seconds will feel sluggish to users. These targets include network overhead but exclude the actual model inference time at the provider.
#Performance #TTFT #ConnectionPooling #Latency #Python #AgenticAI #LearnAI #AIEngineering
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.