Twilio Media Streams to OpenAI Realtime: A WebSocket Bridge
How to bridge a Twilio Media Streams WebSocket to OpenAI Realtime in production: codec conversion, interruption handling, and the timeouts that actually matter.
Phone calls in 2026 still ride on G.711 µ-law at 8 kHz. OpenAI wants 16-bit PCM at 24 kHz. The bridge between them is the most expensive 200 lines of code in your stack.
What problem does the bridge solve?
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]It solves the impedance mismatch between PSTN telephony and modern AI APIs. Twilio Media Streams hands you an inbound WebSocket carrying base64-encoded µ-law frames every 20 ms. OpenAI Realtime expects a different WebSocket carrying base64-encoded PCM16 frames at a different sample rate, with a different event schema. Neither side knows the other exists.
The bridge is a small server process that opens both connections, transcodes audio in both directions, translates events, handles interruption, and disappears when the call ends. Get it wrong and you get either silence, echo, double-talk, or a 500 ms dead zone every time the user interrupts the agent.
How does the bridge actually run?
A typical production bridge is a Node or FastAPI process listening on a public WebSocket URL. Twilio's TwiML <Connect><Stream> verb tells the carrier to open a WebSocket to that URL when a call comes in. On accept, the bridge opens a second WebSocket to OpenAI Realtime. From there:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Inbound
mediaevents from Twilio arrive base64-encoded as 8 kHz µ-law. Decode to PCM16, upsample to 24 kHz, base64-encode, and forward asinput_audio_buffer.appendto OpenAI. - Outbound
response.audio.deltaevents from OpenAI arrive as 24 kHz PCM16. Downsample to 8 kHz, encode to µ-law, base64-encode, and send to Twilio with the originalstreamSid. input_audio_buffer.speech_startedfrom OpenAI means the user just interrupted — fireresponse.cancelupstream and a Twilioclearevent downstream so the in-flight TTS audio drains immediately.
Skip step 3 and the agent will keep talking over the user for 600–1200 ms.
CallSphere's implementation
The CallSphere Sales Calling and After-hours agents both run on this exact pattern. The bridge is a Node.js process supervised by PM2 with Socket.IO carrying live state to the agent dashboard. When a call lands:
- TwiML routes the carrier WebSocket to our bridge service.
- The bridge opens an OpenAI Realtime WebSocket and attaches the per-vertical system prompt (we maintain prompts for Healthcare, Real Estate, Sales, Behavioral Health, Salons, and Auto).
- Audit events stream to Postgres in real time so the dashboard updates within 80 ms.
- On call end, the bridge closes both sockets, persists the final transcript, and triggers webhooks that fire CRM updates.
This is how we route inbound PSTN traffic to the same OpenAI Realtime model that powers our Healthcare agent.
Code: the bridge in 22 lines
ws.on("message", async (raw) => {
const evt = JSON.parse(raw.toString());
if (evt.event === "start") {
streamSid = evt.start.streamSid;
openai.send(JSON.stringify({ type: "session.update", session: SESSION }));
}
if (evt.event === "media") {
const pcm24 = upsample8to24(muLawDecode(Buffer.from(evt.media.payload, "base64")));
openai.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: pcm24.toString("base64"),
}));
}
});
openai.on("message", (raw) => {
const evt = JSON.parse(raw.toString());
if (evt.type === "response.audio.delta") {
const mu = muLawEncode(downsample24to8(Buffer.from(evt.delta, "base64")));
ws.send(JSON.stringify({ event: "media", streamSid, media: { payload: mu.toString("base64") } }));
}
if (evt.type === "input_audio_buffer.speech_started") {
openai.send(JSON.stringify({ type: "response.cancel" }));
ws.send(JSON.stringify({ event: "clear", streamSid }));
}
});
Build steps
- Provision a Twilio number and point its Voice webhook at a TwiML endpoint that returns
<Connect><Stream url="wss://your-bridge/twilio">. - Stand up a WebSocket server (Node
ws, FastAPI WebSocket, or Bun) on a publicly reachable HTTPS endpoint. - Implement µ-law ↔ PCM16 transcoding in pure code; the µ-law table is 256 entries and never changes.
- Build the resampler at 8 kHz ↔ 24 kHz. Use a polyphase filter; nearest-neighbor will sound robotic.
- Wire interruption: handle
speech_startedupstream and emit Twilioclearevents to drop queued audio. - Add a 30-second idle ping to both sockets and tear down the bridge on
stopevents.
FAQ
Why does my agent stutter for the first second? The Twilio start event arrives before audio. If you forward it as audio, the buffer fills with zeros. Wait for the first media event before opening the OpenAI session.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
How do I handle DTMF? Twilio sends DTMF as separate dtmf events on the same WebSocket. Translate them to text and inject into the OpenAI conversation as a user message.
Do I need to handle reconnection? The Twilio Media Stream cannot reconnect mid-call — Twilio simply closes. The OpenAI side can reconnect, but you lose context. Plan to fail closed.
What is the latency budget? Mic-to-mic latency under 1.2 s feels natural. We see 850–950 ms on a same-region bridge with all six steps optimized.
Can I add transcription mid-call? Yes — OpenAI Realtime emits transcript events automatically; persist them to Postgres on the way through.
CallSphere serves six verticals with PSTN bridges identical to this. Try the 14-day free trial or book a demo.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.