---
title: "Agent Latency Budgets: How to Hit Sub-Second Decisions"
description: "Sub-second agent decisions need explicit budgets at every step. The 2026 latency-engineering patterns from real production deployments."
canonical: https://callsphere.ai/blog/agent-latency-budgets-sub-second-decisions-2026
category: "Agentic AI"
tags: ["Latency", "Performance", "Agent Design", "Real-Time AI"]
author: "CallSphere Team"
published: 2026-04-25T00:00:00.000Z
updated: 2026-05-05T10:15:28.426Z
---

# Agent Latency Budgets: How to Hit Sub-Second Decisions

> Sub-second agent decisions need explicit budgets at every step. The 2026 latency-engineering patterns from real production deployments.

## When Latency Becomes a Hard Constraint

Background agents have minutes to think; voice agents have hundreds of milliseconds. Sub-second agent decisions are not solved with one trick; they are solved with explicit budgets at every step. This piece walks through the latency-budgeting discipline.

## The Total Budget

```mermaid
flowchart LR
    User[User waits] --> Total[500ms total budget]
    Total --> Net[Network: 50ms]
    Total --> Th[Think: 200ms]
    Total --> Tool[Tool calls: 150ms]
    Total --> Resp[Respond: 100ms]
```

For a 500ms voice-agent budget, the components must each fit. If you blow through one, you exceed the total.

## Think-Time Budget

The LLM forward pass dominates think time. Patterns to keep it short:

- **Use the smallest model that meets quality**: per-tier routing puts the cheap model in front
- **Cache aggressively**: prompt caching cuts most of the prefill cost
- **Limit output length**: each output token is sequential
- **Use streaming for perceived speed**: TTFB matters more than total latency

For agentic systems with multiple LLM calls per turn, the per-call budget is the total budget divided by call count. A two-LLM-call agent with 500ms total has 250ms per LLM call — barely enough on frontier models without caching.

## Tool-Call Budget

Tool calls add network and database latency. Patterns:

- **Parallelize independent tool calls**: do not serialize when not needed
- **Pre-fetch likely-needed data**: speculatively call tools the agent is likely to want
- **Cache hot data**: customer records, product catalogs change slowly
- **Co-locate tool servers**: same region, same VPC

For voice agents, tool calls during a conversation should typically complete in under 100ms. Anything slower is pushed to background or hidden behind small-talk.

## Network Budget

Wire time is real. Patterns:

- **Region pinning**: route the user to the same region as the inference endpoint
- **Connection pooling**: reuse TCP/TLS connections
- **HTTP/2 or gRPC**: between agent and tool servers
- **Edge ingress**: caller hits the closest edge POP, then proxy to inference

## A Concrete Voice Agent Latency Map

For CallSphere's healthcare voice agent in 2026:

```mermaid
flowchart TB
    Mic[Mic audio] --> VAD[VAD: 100ms]
    VAD --> Stream[Stream to OpenAI: 30ms]
    Stream --> ASR[ASR + LLM forward: 250ms]
    ASR --> Tool[Tool call to backend: 80ms]
    Tool --> LLM2[LLM continuation: 100ms]
    LLM2 --> TTS[TTS streaming: starts at 30ms]
    TTS --> Spk[Speaker]
```

Total p50: about 400ms first-audio. Total p95: about 580ms. Within the 500ms target most of the time.

## Hidden Latency Sources

Non-obvious places latency hides:

- **DNS resolution**: cache or skip
- **TLS handshake**: connection pool
- **Cold container starts**: pre-warm pool
- **Garbage collection in long-running processes**: monitor and tune
- **Database connection acquisition**: warm pool
- **Synchronous logging**: log async to a buffer
- **Serialization of large JSON**: use protobuf or msgpack at hot paths

A 500ms-target system often has a 200ms surprise hiding in one of these.

## Streaming Hides Latency

The single biggest perceived-speed gain in 2026: streaming. The user does not wait 1500ms for a complete answer; they hear the first audio in 300ms and the rest while they listen. End-to-end latency may be similar; perceived latency is much lower.

The patterns that exploit streaming:

- LLM streams tokens
- TTS streams audio chunks
- Frontend renders progressively
- Tool calls happen mid-utterance where possible

## Latency vs Quality

```mermaid
flowchart LR
    Speed[Faster] --> Q1[Smaller model]
    Speed --> Q2[Less context]
    Speed --> Q3[Less reasoning]
    Quality[Better] --> Q4[Larger model]
    Quality --> Q5[More context]
    Quality --> Q6[Reasoning mode]
```

Sub-second decisions cost some quality. The right answer is per-task: critical decisions get the latency budget they need; bulk decisions get the speed.

## Measuring Latency Honestly

Three rules:

- **p95 and p99 matter**: averages hide tail issues
- **End-to-end matters**: not just the LLM call
- **Per-tier breakdown**: latency by tool, by region, by model

Logs without these dimensions cannot answer "why is this slow."

## The Fastest Practical Voice Agent in 2026

Optimized for sub-300ms first-audio:

- Native S2S model (no separate ASR + TTS)
- Pre-warmed connection
- Edge ingress
- Single-region pinned
- Aggressive prompt caching
- No backend tool calls in the hot path (deferred to background)

This is achievable. Most teams do not need it; for the ones that do, the patterns are known.

## Sources

- "LiveKit voice agent latency engineering" — [https://docs.livekit.io](https://docs.livekit.io)
- OpenAI Realtime API documentation — [https://platform.openai.com/docs/guides/realtime](https://platform.openai.com/docs/guides/realtime)
- "Streaming UI patterns" Vercel — [https://vercel.com/blog](https://vercel.com/blog)
- "Latency-quality tradeoff in LLMs" — [https://arxiv.org](https://arxiv.org)
- Pipecat framework — [https://www.pipecat.ai](https://www.pipecat.ai)

---

Source: https://callsphere.ai/blog/agent-latency-budgets-sub-second-decisions-2026