Agent Latency Budgets: How to Hit Sub-Second Decisions
Sub-second agent decisions need explicit budgets at every step. The 2026 latency-engineering patterns from real production deployments.
When Latency Becomes a Hard Constraint
Background agents have minutes to think; voice agents have hundreds of milliseconds. Sub-second agent decisions are not solved with one trick; they are solved with explicit budgets at every step. This piece walks through the latency-budgeting discipline.
The Total Budget
flowchart LR
User[User waits] --> Total[500ms total budget]
Total --> Net[Network: 50ms]
Total --> Th[Think: 200ms]
Total --> Tool[Tool calls: 150ms]
Total --> Resp[Respond: 100ms]
For a 500ms voice-agent budget, the components must each fit. If you blow through one, you exceed the total.
Think-Time Budget
The LLM forward pass dominates think time. Patterns to keep it short:
- Use the smallest model that meets quality: per-tier routing puts the cheap model in front
- Cache aggressively: prompt caching cuts most of the prefill cost
- Limit output length: each output token is sequential
- Use streaming for perceived speed: TTFB matters more than total latency
For agentic systems with multiple LLM calls per turn, the per-call budget is the total budget divided by call count. A two-LLM-call agent with 500ms total has 250ms per LLM call — barely enough on frontier models without caching.
Tool-Call Budget
Tool calls add network and database latency. Patterns:
- Parallelize independent tool calls: do not serialize when not needed
- Pre-fetch likely-needed data: speculatively call tools the agent is likely to want
- Cache hot data: customer records, product catalogs change slowly
- Co-locate tool servers: same region, same VPC
For voice agents, tool calls during a conversation should typically complete in under 100ms. Anything slower is pushed to background or hidden behind small-talk.
Network Budget
Wire time is real. Patterns:
- Region pinning: route the user to the same region as the inference endpoint
- Connection pooling: reuse TCP/TLS connections
- HTTP/2 or gRPC: between agent and tool servers
- Edge ingress: caller hits the closest edge POP, then proxy to inference
A Concrete Voice Agent Latency Map
For CallSphere's healthcare voice agent in 2026:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
flowchart TB
Mic[Mic audio] --> VAD[VAD: 100ms]
VAD --> Stream[Stream to OpenAI: 30ms]
Stream --> ASR[ASR + LLM forward: 250ms]
ASR --> Tool[Tool call to backend: 80ms]
Tool --> LLM2[LLM continuation: 100ms]
LLM2 --> TTS[TTS streaming: starts at 30ms]
TTS --> Spk[Speaker]
Total p50: about 400ms first-audio. Total p95: about 580ms. Within the 500ms target most of the time.
Hidden Latency Sources
Non-obvious places latency hides:
- DNS resolution: cache or skip
- TLS handshake: connection pool
- Cold container starts: pre-warm pool
- Garbage collection in long-running processes: monitor and tune
- Database connection acquisition: warm pool
- Synchronous logging: log async to a buffer
- Serialization of large JSON: use protobuf or msgpack at hot paths
A 500ms-target system often has a 200ms surprise hiding in one of these.
Streaming Hides Latency
The single biggest perceived-speed gain in 2026: streaming. The user does not wait 1500ms for a complete answer; they hear the first audio in 300ms and the rest while they listen. End-to-end latency may be similar; perceived latency is much lower.
The patterns that exploit streaming:
- LLM streams tokens
- TTS streams audio chunks
- Frontend renders progressively
- Tool calls happen mid-utterance where possible
Latency vs Quality
flowchart LR
Speed[Faster] --> Q1[Smaller model]
Speed --> Q2[Less context]
Speed --> Q3[Less reasoning]
Quality[Better] --> Q4[Larger model]
Quality --> Q5[More context]
Quality --> Q6[Reasoning mode]
Sub-second decisions cost some quality. The right answer is per-task: critical decisions get the latency budget they need; bulk decisions get the speed.
Measuring Latency Honestly
Three rules:
- p95 and p99 matter: averages hide tail issues
- End-to-end matters: not just the LLM call
- Per-tier breakdown: latency by tool, by region, by model
Logs without these dimensions cannot answer "why is this slow."
The Fastest Practical Voice Agent in 2026
Optimized for sub-300ms first-audio:
- Native S2S model (no separate ASR + TTS)
- Pre-warmed connection
- Edge ingress
- Single-region pinned
- Aggressive prompt caching
- No backend tool calls in the hot path (deferred to background)
This is achievable. Most teams do not need it; for the ones that do, the patterns are known.
Sources
- "LiveKit voice agent latency engineering" — https://docs.livekit.io
- OpenAI Realtime API documentation — https://platform.openai.com/docs/guides/realtime
- "Streaming UI patterns" Vercel — https://vercel.com/blog
- "Latency-quality tradeoff in LLMs" — https://arxiv.org
- Pipecat framework — https://www.pipecat.ai
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.