---
title: "Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)"
description: "A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times."
canonical: https://callsphere.ai/blog/voice-ai-latency-sub-second-why-matters
category: "Technical Guides"
tags: ["AI Voice Agent", "Technical Guide", "Latency", "Performance", "OpenAI", "Optimization", "Realtime"]
author: "CallSphere Team"
published: 2026-04-08T00:00:00.000Z
updated: 2026-05-06T01:02:47.352Z
---

# Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

> A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

## The conversational cliff

Humans expect a reply within roughly 500-700ms in natural conversation. Push past one second and callers feel like they are talking to a computer. Push past two seconds and they start talking over the agent and abandoning the call. Latency is not a nice-to-have in voice AI; it is the single biggest determinant of whether the conversation feels real.

This post walks through the full latency budget for a modern voice agent and the techniques that get you reliably under one second.

```
total = network + vad + stt + llm_first_token + llm_reasoning + tts_first_frame + playback
```

## Architecture overview

```
caller                                           time budget
  │
  ├─► network_in          ─────►  40ms
  ├─► VAD decision        ─────► 150ms
  ├─► STT partial         ─────► 150ms (overlaps VAD)
  ├─► LLM first token     ─────► 250ms
  ├─► LLM finish          ─────► 150ms (streams during TTS)
  ├─► TTS first audio     ─────► 120ms
  ├─► network_out         ─────►  40ms
  └─► speaker             ─────►
                             ─────────
                   total  →   ~750ms
```

## Prerequisites

- A working voice agent pipeline.
- An OpenTelemetry tracing backend (Honeycomb, Tempo, Jaeger).
- The ability to measure wall-clock times at every boundary.

## Step-by-step walkthrough

### 1. Instrument every stage with spans

```python
from opentelemetry import trace
tracer = trace.get_tracer("voice-agent")

async def handle_turn(audio_in):
    with tracer.start_as_current_span("turn") as span:
        with tracer.start_as_current_span("vad"):
            ...  # VAD decision
        with tracer.start_as_current_span("stt"):
            ...
        with tracer.start_as_current_span("llm_first_token"):
            ...
        with tracer.start_as_current_span("tts_first_frame"):
            ...
```

### 2. Use streaming everything

Never wait for a stage to finish before starting the next. STT should emit partials, the LLM should stream tokens, TTS should stream audio frames. The end-of-turn signal is the only blocking event.

```mermaid
flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching
vLLM scheduler"]
    PREF{"Prefill or
decode?"}
    PRE["Prefill phase
parallel attention"]
    DEC["Decode phase
token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling
top-p, temp"]
    STREAM["Stream tokens
to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
```

### 3. Collapse the pipeline

The OpenAI Realtime API removes three network hops by doing STT, LLM, and TTS in one WebSocket. That alone saves 200-400ms versus a DIY stack of Whisper + GPT + ElevenLabs as separate HTTP calls.

```typescript
ws.send(JSON.stringify({
  type: "session.update",
  session: {
    turn_detection: { type: "server_vad", silence_duration_ms: 400 },
    input_audio_format: "pcm16",
    output_audio_format: "pcm16",
  },
}));
```

### 4. Prewarm everything

At call setup, open the Realtime WebSocket before the caller says "hello". The TLS handshake and model load dominate first-turn latency otherwise.

```python
async def on_incoming_ring(call_sid: str):
    session = await open_realtime_session()  # TLS + handshake now, not mid-call
    sessions[call_sid] = session
```

### 5. Keep tool calls off the hot path when possible

If a tool call takes >300ms, the agent should speak a filler ("let me pull that up") and stream it while the tool runs. The Realtime API makes this easy with `response.create` plus an instructions override.

### 6. Measure p50, p95, and p99 separately

Average latency hides the calls that feel broken. Track percentiles per stage and alert on p95.

## Production considerations

- **Geography**: keep the edge, the model, and the carrier in the same region. Cross-region adds 60-150ms.
- **Cold starts**: if you run on serverless, warm pools are mandatory.
- **Network path**: use private connectivity to your carrier if they offer it.
- **GC pauses**: Node and Python both have them; profile under load.
- **Audio codec conversion**: each resample costs 5-15ms. Do it once per direction.

## CallSphere's real implementation

CallSphere targets and maintains sub-one-second end-to-end response time across every production vertical. The voice plane runs on the OpenAI Realtime API with `gpt-4o-realtime-preview-2025-06-03`, PCM16 at 24kHz, and server VAD — a single WebSocket per call, pre-warmed at ring, terminated at a FastAPI edge co-located with Twilio's media region.

The multi-agent topologies — 14 tools for healthcare, 10 for real estate, 4 for salon, 7 for after-hours escalation, 10 plus RAG for IT helpdesk, and the 5-specialist ElevenLabs sales pod — are all orchestrated through the OpenAI Agents SDK. Handoffs between agents reuse the same session so there is no TLS renegotiation mid-call, and post-call analytics from a GPT-4o-mini pipeline run asynchronously so they never contend with the hot audio path. CallSphere supports 57+ languages with the same budget.

## Common pitfalls

- **Buffering audio for "smoothing"**: it adds latency for negligible quality gain.
- **Running STT in a separate HTTP request**: you lose streaming.
- **Serial tool calls**: parallelize them when the arguments are independent.
- **Logging in the hot path**: async log emit, never block.
- **Ignoring p99**: a 5% of calls that feel broken is a 5% churn signal.

## FAQ

### What is a realistic target?

Under 1 second at p50, under 1.4 seconds at p95.

### Does the LLM model size matter?

Yes, but less than you think. The Realtime API's gpt-4o variant is already tuned for low first-token latency.

### How much does TLS handshake cost?

40-120ms the first time, free on reuse.

### Is WebRTC faster than Twilio Media Streams?

Marginally, because WebRTC uses UDP. Twilio over WebSocket is still plenty fast for production.

### Can I reduce latency by running a local model?

Only if your local model beats the Realtime API end-to-end, which is rarely true today.

## Next steps

Want to measure latency on your current stack? [Book a demo](https://callsphere.tech/contact) to see how CallSphere hits sub-second on live traffic, read the [technology page](https://callsphere.tech/technology), or compare [pricing](https://callsphere.tech/pricing).

#CallSphere #Latency #VoiceAI #Performance #OpenAIRealtime #Observability #AIVoiceAgents

---

Source: https://callsphere.ai/blog/voice-ai-latency-sub-second-why-matters