By Sagar Shankaran, Founder of CallSphere
Twilio in front, WebSocket bridge in the middle, OpenAI Realtime at the end. The pattern that dominates AI voice production in 2026 explained from SIP signaling through to model output.
Key takeaways
Every production AI voice deployment we have audited in 2026 looks roughly the same: a SIP/PSTN provider in front, a WebSocket bridge in the middle, OpenAI Realtime (or Gemini Live or another speech-to-speech model) at the end. The most common front is Twilio. Here is what that pattern actually looks like under the hood.
flowchart LR
UA[SIP UA] -- REGISTER --> Reg[Registrar]
UA -- INVITE --> Proxy[SIP Proxy]
Proxy --> Dispatcher[Kamailio dispatcher]
Dispatcher --> Worker1[FreeSWITCH worker]
Dispatcher --> Worker2[FreeSWITCH worker]
Worker1 --> AI[(AI agent)]
Worker2 --> AITwilio Programmable Voice acts as the SIP/PSTN edge for AI voice in 2026 because it solves the problems most AI builders do not want to solve: SIP trunks, STIR/SHAKEN attestation, codec interop, A2P 10DLC for SMS follow-ups, dial peer management, and the carrier relationship spider's web. You hand Twilio a phone number, point a TwiML application or webhook at it, and Twilio forwards the audio over Media Streams (a WebSocket protocol) or via a SIP trunk to your bridge.
The "bridge" is a small server that does the codec adaptation (PCMU 8 kHz to L16 16 kHz), the protocol adaptation (Twilio Media Streams JSON envelope to OpenAI Realtime audio events), and the orchestration (tool calls, transfers, recording). At Twilio Signal 2026, Twilio launched Conversation Relay which bundles parts of this bridge as a managed service, but the DIY pattern is still common.
A typical inbound flow:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
[Caller PSTN] -> [Twilio edge SBC] -> webhook to your TwiML endpoint
TwiML returns:
<Response>
<Connect>
<Stream url="wss://bridge.callsphere.ai/realtime" track="inbound_track">
<Parameter name="agent_id" value="healthcare-front-desk"/>
<Parameter name="from" value="+19175551212"/>
</Stream>
</Connect>
</Response>
Twilio opens a WSS to your bridge. The bridge upsamples PCMU 8 kHz mu-law to PCM16 16 kHz, then feeds it to OpenAI Realtime as input_audio_buffer.append events. Realtime transcribes, runs the LLM, and streams TTS back. Your bridge takes Realtime's PCM 24 kHz output, downsamples to 8 kHz, mu-law encodes, and sends it back through Twilio Media Streams as media events with base64 payload.
# FastAPI bridge sketch
@app.websocket("/realtime")
async def twilio_bridge(ws: WebSocket):
await ws.accept()
openai_ws = await connect_openai_realtime(model="gpt-realtime-2026")
async def twilio_to_openai():
async for msg in ws.iter_text():
evt = json.loads(msg)
if evt["event"] == "media":
pcm = mulaw_to_pcm16(base64.b64decode(evt["media"]["payload"]))
await openai_ws.send_json({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(pcm).decode()
})
async def openai_to_twilio():
async for msg in openai_ws.iter_text():
evt = json.loads(msg)
if evt["type"] == "response.audio.delta":
pcm = base64.b64decode(evt["delta"])
mulaw = pcm16_to_mulaw(pcm)
await ws.send_json({
"event": "media",
"streamSid": stream_sid,
"media": {"payload": base64.b64encode(mulaw).decode()}
})
await asyncio.gather(twilio_to_openai(), openai_to_twilio())
The pattern is mechanically simple but the production hardening is where the work lives: backpressure, reconnect, partial-frame handling, tool calls during a response, transfer triggers.
CallSphere terminates every leg on Twilio Programmable Voice across all six verticals. Healthcare AI runs on FastAPI :8084 with the exact bridge pattern above to OpenAI Realtime; Real Estate AI, Salon AI, IT Helpdesk AI, and After-Hours AI use the same template with vertical-specific prompts and tools. Sales Calling AI fires up to 5 concurrent outbound calls per tenant from a worker pool that originates the call via Twilio API and connects the answered call to the same bridge endpoint. After-Hours AI uses Twilio simul call+SMS to on-call staff with a 120-second timeout - a separate code path from the bridge that handles dial+SMS race conditions. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 alignment, $149/$499/$1499 pricing, 14-day trial, and 22% affiliate, the gateway pattern is uniform with per-vertical tools and prompts layered on top.
<Connect><Stream/></Connect> pointing to your bridge WSS endpoint.start, media, stop events from Twilio and translate to OpenAI Realtime.<Dial><Conference/> with transcript handoff.Why not use Twilio Conversation Relay directly? For simpler use cases, do. For complex orchestration with custom tools or transfer logic, the DIY bridge gives more control.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
What about Telnyx or Bandwidth? Same pattern, different vendor; Telnyx Voice SDK and Bandwidth Streaming work analogously. Twilio dominates AI voice in 2026 by ecosystem maturity.
Does this work for outbound dialing?
Yes. Originate via Twilio API (/Calls endpoint) with TwiML pointing to the same Stream URL; the bridge handles answer detection.
What is the latency budget? Network to Twilio: 20-50 ms. Twilio to bridge: 30-80 ms (region dependent). Bridge to OpenAI: 30-80 ms. Realtime first-token: 200-400 ms. TTS synthesis: 50-150 ms. Bridge back to caller: same as inbound. Total turn-taking: 600-1500 ms typical.
How do I scale the bridge? Horizontal: each call is one WebSocket pair. Standard FastAPI/uvicorn behind a load balancer hits thousands of concurrent calls per node before resource limits.
Start a 14-day trial on the Twilio + Realtime stack, see pricing for $149/$499/$1499 tiers, or contact us about production AI voice deployments.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
A VoIP telephone number is a phone number that routes calls over the internet instead of copper lines. Learn what a VoIP number is, how to get one, what it costs, and how to pair it with an AI voice agent in 2026.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
OpenAI's GPT-Realtime-2 quadruples voice context to 128K tokens. Here is exactly what the 32K-to-128K jump changes for production phone agents.
MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.
On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.
© 2026 CallSphere LLC. All rights reserved.