By Sagar Shankaran, Founder of CallSphere
Where every millisecond goes between caller and AI: PSTN, carrier, STT, LLM, TTS, and back. The component-level targets that ship in 2026 and how to hit them.
Key takeaways
Humans expect a reply within roughly 500 to 700 ms in natural conversation. Anything past one second feels artificial; past two seconds the caller starts talking over the agent. The 2026 latency budget for an AI phone agent is unforgiving and the math is well understood.
flowchart TD
Out[Outbound campaign] --> Twilio[Twilio Voice API]
Twilio --> STIR[STIR/SHAKEN attestation]
STIR --> Carrier[Originating carrier]
Carrier --> Term[Terminating carrier]
Term --> Recipient[Recipient phone]
Recipient --> Webhook[/voice webhook/]
Webhook --> Agent[AI sales agent]Twilio published explicit November 2025 targets that the industry has converged on:
ConversationRelay reports <0.5 sec p50 and <0.725 sec p95.
A cascaded agent (STT → LLM → TTS) requires at least ten network traversals to produce a single response: two voice legs over the public network, eight inter-service handoffs. Network transmission contributes 40 to 70 ms; orchestration adds the largest delays at roughly 350 ms.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
PSTN itself adds about 500 ms of fixed latency across the call path on a typical North American route, leaving only a few hundred milliseconds for the AI processing budget. That is why the speech-to-speech architectures (OpenAI Realtime SIP-direct) win for natural conversation: they collapse multiple hops into one model.
The components on the wire, with realistic 2026 contributions:
A speech-to-speech model (OpenAI Realtime) collapses 5 through 8 into one model with TTFB under 500 ms. That is why those architectures are taking over.
CallSphere targets sub-1-second mouth-to-ear at p50 across all six verticals on Twilio. Healthcare AI on FastAPI :8084 to OpenAI Realtime hits this comfortably with the SIP-direct pattern. Sales Calling AI with five concurrent outbound on Twilio runs slightly higher because outbound dial setup adds initial overhead. After-Hours AI with simultaneous Twilio call plus SMS and 120 second timeout treats latency differently — the SMS is parallel, so the voice path follows the standard 1-second target.
The 37 agents across 90+ tools and 115+ database tables, HIPAA and SOC 2 controls, and pricing of $149/$499/$1499 for 1/3/10 numbers do not change based on latency tier; latency is a quality metric we monitor per call. The 14-day trial lets prospects compare CallSphere's measured latency against their existing IVR or human answering service.
<!-- TwiML: outbound call with status callback for latency telemetry -->
<Response>
<Dial
callerId="+15555550100"
answerOnBridge="true">
<Number
statusCallback="https://api.callsphere.ai/twilio/dial-status"
statusCallbackEvent="initiated ringing answered completed"
statusCallbackMethod="POST">
+15555550199
</Number>
</Dial>
</Response>
What's the single biggest latency win in 2026? Switching from a cascaded STT → LLM → TTS pipeline to a speech-to-speech model (OpenAI Realtime).
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Does carrier choice really matter? Yes, especially at p95 and in regions where the carrier's path differs significantly from your provider's path. Telnyx's private backbone matters most outside the major US metros.
What's the floor for human-feel latency? About 500 to 700 ms mouth-to-ear. Below that, the human experience improves only marginally.
Can I get under 500 ms? Possible end-to-end speech-to-speech with optimized infrastructure, but the PSTN floor is about 500 ms by itself. WebRTC paths can go lower.
What's the most overlooked optimization? End-of-turn detection. Tuning it carefully shaves 100+ ms off perceived latency.
Start a 14-day trial and measure CallSphere's latency on your own calls, see pricing, or compare with the Twilio integration.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
An honest 2026 guide to VoIP desk phones. Hardware vs softphone, top picks, when an internet phone is worth it, and where AI voice agents fit.
The best business phone app in 2026 is the one with an AI agent attached. Compare options, costs, and what an AI phone app actually does for a small business.
A founder's guide to business phone systems in 2026. Cloud vs on-prem, AI voice agents, small business pricing, and what actually works for under 100 seats.
The 2024 NPRM proposes mandatory penetration tests every 12 months and vulnerability scans every 6 months. Here is how an AI voice agent should be tested in 2026.
AWS HealthScribe became the open scribe layer EHR vendors built on top of in 2026. Here's the API surface, the per-encounter pricing, the BAA terms.
Why Claude salon AI is reshaping voice and chat automation, with concrete patterns for appointment AI in production deployments. A field-tested view from production teams shippi...
© 2026 CallSphere LLC. All rights reserved.