Every production AI voice deployment we have audited in 2026 looks roughly the same: a SIP/PSTN provider in front, a WebSocket bridge in the middle, OpenAI Realtime (or Gemini Live or another speech-to-speech model) at the end. The most common front is Twilio. Here is what that pattern actually looks like under the hood.

Background

flowchart LR
  UA[SIP UA] -- REGISTER --> Reg[Registrar]
  UA -- INVITE --> Proxy[SIP Proxy]
  Proxy --> Dispatcher[Kamailio dispatcher]
  Dispatcher --> Worker1[FreeSWITCH worker]
  Dispatcher --> Worker2[FreeSWITCH worker]
  Worker1 --> AI[(AI agent)]
  Worker2 --> AI

CallSphere reference architecture

Twilio Programmable Voice acts as the SIP/PSTN edge for AI voice in 2026 because it solves the problems most AI builders do not want to solve: SIP trunks, STIR/SHAKEN attestation, codec interop, A2P 10DLC for SMS follow-ups, dial peer management, and the carrier relationship spider's web. You hand Twilio a phone number, point a TwiML application or webhook at it, and Twilio forwards the audio over Media Streams (a WebSocket protocol) or via a SIP trunk to your bridge.

The "bridge" is a small server that does the codec adaptation (PCMU 8 kHz to L16 16 kHz), the protocol adaptation (Twilio Media Streams JSON envelope to OpenAI Realtime audio events), and the orchestration (tool calls, transfers, recording). At Twilio Signal 2026, Twilio launched Conversation Relay which bundles parts of this bridge as a managed service, but the DIY pattern is still common.

Technical deep-dive

A typical inbound flow:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

[Caller PSTN] -> [Twilio edge SBC] -> webhook to your TwiML endpoint
TwiML returns:
  <Response>
    <Connect>
      <Stream url="wss://bridge.callsphere.ai/realtime" track="inbound_track">
        <Parameter name="agent_id" value="healthcare-front-desk"/>
        <Parameter name="from" value="+19175551212"/>
      </Stream>
    </Connect>
  </Response>

Twilio opens a WSS to your bridge. The bridge upsamples PCMU 8 kHz mu-law to PCM16 16 kHz, then feeds it to OpenAI Realtime as input_audio_buffer.append events. Realtime transcribes, runs the LLM, and streams TTS back. Your bridge takes Realtime's PCM 24 kHz output, downsamples to 8 kHz, mu-law encodes, and sends it back through Twilio Media Streams as media events with base64 payload.

# FastAPI bridge sketch
@app.websocket("/realtime")
async def twilio_bridge(ws: WebSocket):
    await ws.accept()
    openai_ws = await connect_openai_realtime(model="gpt-realtime-2026")

    async def twilio_to_openai():
        async for msg in ws.iter_text():
            evt = json.loads(msg)
            if evt["event"] == "media":
                pcm = mulaw_to_pcm16(base64.b64decode(evt["media"]["payload"]))
                await openai_ws.send_json({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(pcm).decode()
                })

    async def openai_to_twilio():
        async for msg in openai_ws.iter_text():
            evt = json.loads(msg)
            if evt["type"] == "response.audio.delta":
                pcm = base64.b64decode(evt["delta"])
                mulaw = pcm16_to_mulaw(pcm)
                await ws.send_json({
                    "event": "media",
                    "streamSid": stream_sid,
                    "media": {"payload": base64.b64encode(mulaw).decode()}
                })

    await asyncio.gather(twilio_to_openai(), openai_to_twilio())

The pattern is mechanically simple but the production hardening is where the work lives: backpressure, reconnect, partial-frame handling, tool calls during a response, transfer triggers.

CallSphere implementation

CallSphere terminates every leg on Twilio Programmable Voice across all six verticals. Healthcare AI runs on FastAPI :8084 with the exact bridge pattern above to OpenAI Realtime; Real Estate AI, Salon AI, IT Helpdesk AI, and After-Hours AI use the same template with vertical-specific prompts and tools. Sales Calling AI fires up to 5 concurrent outbound calls per tenant from a worker pool that originates the call via Twilio API and connects the answered call to the same bridge endpoint. After-Hours AI uses Twilio simul call+SMS to on-call staff with a 120-second timeout - a separate code path from the bridge that handles dial+SMS race conditions. Across 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 alignment, $149/$499/$1499 pricing, 14-day trial, and 22% affiliate, the gateway pattern is uniform with per-vertical tools and prompts layered on top.

Implementation steps

Buy or port a phone number through Twilio; complete Trust Hub Customer Profile + SHAKEN/STIR Trust Product for Level A attestation.
Configure a TwiML App (or webhook) that returns <Connect><Stream/></Connect> pointing to your bridge WSS endpoint.
Stand up the bridge on FastAPI or equivalent; handle start, media, stop events from Twilio and translate to OpenAI Realtime.
Implement codec adaptation: mu-law-to-PCM16 inbound, PCM-to-mu-law outbound; sample-rate convert with a sinc filter.
Wire OpenAI Realtime tools to your domain APIs (book appointment, look up account, transfer call).
Implement transfer via TwiML <Dial><Conference/> with transcript handoff.
Add observability: log call SID, stream SID, latency to first audio out, ASR turn timing, tool calls per turn.
Stress test with SIPp or Cekura before rollout; aim for at least 2x your peak concurrent.

FAQ

Why not use Twilio Conversation Relay directly? For simpler use cases, do. For complex orchestration with custom tools or transfer logic, the DIY bridge gives more control.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

What about Telnyx or Bandwidth? Same pattern, different vendor; Telnyx Voice SDK and Bandwidth Streaming work analogously. Twilio dominates AI voice in 2026 by ecosystem maturity.

Does this work for outbound dialing? Yes. Originate via Twilio API (/Calls endpoint) with TwiML pointing to the same Stream URL; the bridge handles answer detection.

What is the latency budget? Network to Twilio: 20-50 ms. Twilio to bridge: 30-80 ms (region dependent). Bridge to OpenAI: 30-80 ms. Realtime first-token: 200-400 ms. TTS synthesis: 50-150 ms. Bridge back to caller: same as inbound. Total turn-taking: 600-1500 ms typical.

How do I scale the bridge? Horizontal: each call is one WebSocket pair. Standard FastAPI/uvicorn behind a load balancer hits thousands of concurrent calls per node before resource limits.

Sources

Start a 14-day trial on the Twilio + Realtime stack, see pricing for $149/$499/$1499 tiers, or contact us about production AI voice deployments.

Twilio Programmable Voice as a SIP Gateway for AI in 2026: The Default Production Pattern

Background

Technical deep-dive

CallSphere implementation

Implementation steps

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

Why Voice AI Builders Pick OpenAI Over Claude (and When That's the Wrong Call)

Ollama in 2026: Is It Production-Ready Now? An Honest Look

OpenAI Realtime API: How CallSphere Ships Faster Than Vapi

Voicemail Detection Accuracy: CallSphere vs Vapi (with Examples)