Skip to content
AI Engineering
AI Engineering11 min read0 views

The Latency Budget for AI Voice Agents Across PSTN in 2026

Where every millisecond goes between caller and AI: PSTN, carrier, STT, LLM, TTS, and back. The component-level targets that ship in 2026 and how to hit them.

Humans expect a reply within roughly 500 to 700 ms in natural conversation. Anything past one second feels artificial; past two seconds the caller starts talking over the agent. The 2026 latency budget for an AI phone agent is unforgiving and the math is well understood.

Background: the 2026 latency picture

flowchart TD
  Out[Outbound campaign] --> Twilio[Twilio Voice API]
  Twilio --> STIR[STIR/SHAKEN attestation]
  STIR --> Carrier[Originating carrier]
  Carrier --> Term[Terminating carrier]
  Term --> Recipient[Recipient phone]
  Recipient --> Webhook[/voice webhook/]
  Webhook --> Agent[AI sales agent]
CallSphere reference architecture

Twilio published explicit November 2025 targets that the industry has converged on:

  • Mouth-to-ear turn gap (what the human perceives): target 1,115 ms, upper limit 1,400 ms.
  • Platform turn gap (internal processing only): target 885 ms, upper limit 1,100 ms.
  • STT first transcript: target 350 ms, upper limit 500 ms.
  • LLM time-to-first-token: target 375 ms, upper limit 750 ms.
  • TTS first byte: target 100 ms, upper limit 250 ms.

ConversationRelay reports <0.5 sec p50 and <0.725 sec p95.

A cascaded agent (STT → LLM → TTS) requires at least ten network traversals to produce a single response: two voice legs over the public network, eight inter-service handoffs. Network transmission contributes 40 to 70 ms; orchestration adds the largest delays at roughly 350 ms.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

PSTN itself adds about 500 ms of fixed latency across the call path on a typical North American route, leaving only a few hundred milliseconds for the AI processing budget. That is why the speech-to-speech architectures (OpenAI Realtime SIP-direct) win for natural conversation: they collapse multiple hops into one model.

How VoIP and SIP work for this use case

The components on the wire, with realistic 2026 contributions:

  1. Caller's phone to originating carrier: 50 to 100 ms.
  2. Originating carrier to terminating carrier (Twilio/Telnyx): 50 to 200 ms (lower on Telnyx private backbone).
  3. Edge SBC at terminating carrier: 5 to 20 ms.
  4. Carrier to AI service (SIP-direct or WebSocket bridge): 5 to 50 ms.
  5. STT (if cascaded): 150 to 500 ms.
  6. End-of-turn detection: 100 to 400 ms (often the long pole).
  7. LLM time-to-first-token: 350 to 1000 ms.
  8. TTS first byte: 75 to 250 ms.
  9. Return path through carrier: 50 to 200 ms.
  10. Caller's phone: 30 to 80 ms.

A speech-to-speech model (OpenAI Realtime) collapses 5 through 8 into one model with TTFB under 500 ms. That is why those architectures are taking over.

CallSphere implementation

CallSphere targets sub-1-second mouth-to-ear at p50 across all six verticals on Twilio. Healthcare AI on FastAPI :8084 to OpenAI Realtime hits this comfortably with the SIP-direct pattern. Sales Calling AI with five concurrent outbound on Twilio runs slightly higher because outbound dial setup adds initial overhead. After-Hours AI with simultaneous Twilio call plus SMS and 120 second timeout treats latency differently — the SMS is parallel, so the voice path follows the standard 1-second target.

The 37 agents across 90+ tools and 115+ database tables, HIPAA and SOC 2 controls, and pricing of $149/$499/$1499 for 1/3/10 numbers do not change based on latency tier; latency is a quality metric we monitor per call. The 14-day trial lets prospects compare CallSphere's measured latency against their existing IVR or human answering service.

Build and integration steps

  1. Instrument every hop. Log timestamps at: SIP INVITE received, STT first transcript, LLM first token, TTS first byte, audio first frame to caller.
  2. Compute mouth-to-ear turn gap as the metric of record; track p50, p95, p99.
  3. Pick a carrier with a low-latency private backbone where regional latency matters (Telnyx for many regions, Twilio for global breadth).
  4. Use SIP-direct to OpenAI Realtime where possible to collapse STT/LLM/TTS into one model.
  5. Co-locate your tool servers in the same region as your AI provider.
  6. Tune end-of-turn detection conservatively: false positives waste model time, false negatives slow turn-taking.
  7. Use Opus where the leg supports it; G.711 where it doesn't.
  8. Alert on p95 mouth-to-ear above 1,400 ms; investigate every breach.

Code or config snippet

<!-- TwiML: outbound call with status callback for latency telemetry -->
<Response>
  <Dial
    callerId="+15555550100"
    answerOnBridge="true">
    <Number
      statusCallback="https://api.callsphere.ai/twilio/dial-status"
      statusCallbackEvent="initiated ringing answered completed"
      statusCallbackMethod="POST">
      +15555550199
    </Number>
  </Dial>
</Response>

FAQ

What's the single biggest latency win in 2026? Switching from a cascaded STT → LLM → TTS pipeline to a speech-to-speech model (OpenAI Realtime).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Does carrier choice really matter? Yes, especially at p95 and in regions where the carrier's path differs significantly from your provider's path. Telnyx's private backbone matters most outside the major US metros.

What's the floor for human-feel latency? About 500 to 700 ms mouth-to-ear. Below that, the human experience improves only marginally.

Can I get under 500 ms? Possible end-to-end speech-to-speech with optimized infrastructure, but the PSTN floor is about 500 ms by itself. WebRTC paths can go lower.

What's the most overlooked optimization? End-of-turn detection. Tuning it carefully shaves 100+ ms off perceived latency.

Sources

Start a 14-day trial and measure CallSphere's latency on your own calls, see pricing, or compare with the Twilio integration.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.