Agent Latency Budgets: How to Hit Sub-Second Decisions
By Sagar Shankaran, Founder of CallSphere
Sub-second agent decisions need explicit budgets at every step. The 2026 latency-engineering patterns from real production deployments.
Key takeaways
When Latency Becomes a Hard Constraint
Background agents have minutes to think; voice agents have hundreds of milliseconds. Sub-second agent decisions are not solved with one trick; they are solved with explicit budgets at every step. This piece walks through the latency-budgeting discipline.
The Total Budget
flowchart LR
User[User waits] --> Total[500ms total budget]
Total --> Net[Network: 50ms]
Total --> Th[Think: 200ms]
Total --> Tool[Tool calls: 150ms]
Total --> Resp[Respond: 100ms]
For a 500ms voice-agent budget, the components must each fit. If you blow through one, you exceed the total.
Think-Time Budget
The LLM forward pass dominates think time. Patterns to keep it short:
- Use the smallest model that meets quality: per-tier routing puts the cheap model in front
- Cache aggressively: prompt caching cuts most of the prefill cost
- Limit output length: each output token is sequential
- Use streaming for perceived speed: TTFB matters more than total latency
For agentic systems with multiple LLM calls per turn, the per-call budget is the total budget divided by call count. A two-LLM-call agent with 500ms total has 250ms per LLM call — barely enough on frontier models without caching.
Tool-Call Budget
Tool calls add network and database latency. Patterns:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Parallelize independent tool calls: do not serialize when not needed
- Pre-fetch likely-needed data: speculatively call tools the agent is likely to want
- Cache hot data: customer records, product catalogs change slowly
- Co-locate tool servers: same region, same VPC
For voice agents, tool calls during a conversation should typically complete in under 100ms. Anything slower is pushed to background or hidden behind small-talk.
Network Budget
Wire time is real. Patterns:
- Region pinning: route the user to the same region as the inference endpoint
- Connection pooling: reuse TCP/TLS connections
- HTTP/2 or gRPC: between agent and tool servers
- Edge ingress: caller hits the closest edge POP, then proxy to inference
A Concrete Voice Agent Latency Map
For CallSphere's healthcare voice agent in 2026:
flowchart TB
Mic[Mic audio] --> VAD[VAD: 100ms]
VAD --> Stream[Stream to OpenAI: 30ms]
Stream --> ASR[ASR + LLM forward: 250ms]
ASR --> Tool[Tool call to backend: 80ms]
Tool --> LLM2[LLM continuation: 100ms]
LLM2 --> TTS[TTS streaming: starts at 30ms]
TTS --> Spk[Speaker]
Total p50: about 400ms first-audio. Total p95: about 580ms. Within the 500ms target most of the time.
Hidden Latency Sources
Non-obvious places latency hides:
- DNS resolution: cache or skip
- TLS handshake: connection pool
- Cold container starts: pre-warm pool
- Garbage collection in long-running processes: monitor and tune
- Database connection acquisition: warm pool
- Synchronous logging: log async to a buffer
- Serialization of large JSON: use protobuf or msgpack at hot paths
A 500ms-target system often has a 200ms surprise hiding in one of these.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Streaming Hides Latency
The single biggest perceived-speed gain in 2026: streaming. The user does not wait 1500ms for a complete answer; they hear the first audio in 300ms and the rest while they listen. End-to-end latency may be similar; perceived latency is much lower.
The patterns that exploit streaming:
- LLM streams tokens
- TTS streams audio chunks
- Frontend renders progressively
- Tool calls happen mid-utterance where possible
Latency vs Quality
flowchart LR
Speed[Faster] --> Q1[Smaller model]
Speed --> Q2[Less context]
Speed --> Q3[Less reasoning]
Quality[Better] --> Q4[Larger model]
Quality --> Q5[More context]
Quality --> Q6[Reasoning mode]
Sub-second decisions cost some quality. The right answer is per-task: critical decisions get the latency budget they need; bulk decisions get the speed.
Measuring Latency Honestly
Three rules:
- p95 and p99 matter: averages hide tail issues
- End-to-end matters: not just the LLM call
- Per-tier breakdown: latency by tool, by region, by model
Logs without these dimensions cannot answer "why is this slow."
The Fastest Practical Voice Agent in 2026
Optimized for sub-300ms first-audio:
- Native S2S model (no separate ASR + TTS)
- Pre-warmed connection
- Edge ingress
- Single-region pinned
- Aggressive prompt caching
- No backend tool calls in the hot path (deferred to background)
This is achievable. Most teams do not need it; for the ones that do, the patterns are known.
Sources
- "LiveKit voice agent latency engineering" — https://docs.livekit.io
- OpenAI Realtime API documentation — https://platform.openai.com/docs/guides/realtime
- "Streaming UI patterns" Vercel — https://vercel.com/blog
- "Latency-quality tradeoff in LLMs" — https://arxiv.org
- Pipecat framework — https://www.pipecat.ai
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.