By Sagar Shankaran, Founder of CallSphere
How to bridge a Twilio Media Streams WebSocket to OpenAI Realtime in production: codec conversion, interruption handling, and the timeouts that actually matter.
Key takeaways
Phone calls in 2026 still ride on G.711 µ-law at 8 kHz. OpenAI wants 16-bit PCM at 24 kHz. The bridge between them is the most expensive 200 lines of code in your stack.
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]It solves the impedance mismatch between PSTN telephony and modern AI APIs. Twilio Media Streams hands you an inbound WebSocket carrying base64-encoded µ-law frames every 20 ms. OpenAI Realtime expects a different WebSocket carrying base64-encoded PCM16 frames at a different sample rate, with a different event schema. Neither side knows the other exists.
The bridge is a small server process that opens both connections, transcodes audio in both directions, translates events, handles interruption, and disappears when the call ends. Get it wrong and you get either silence, echo, double-talk, or a 500 ms dead zone every time the user interrupts the agent.
A typical production bridge is a Node or FastAPI process listening on a public WebSocket URL. Twilio's TwiML <Connect><Stream> verb tells the carrier to open a WebSocket to that URL when a call comes in. On accept, the bridge opens a second WebSocket to OpenAI Realtime. From there:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
media events from Twilio arrive base64-encoded as 8 kHz µ-law. Decode to PCM16, upsample to 24 kHz, base64-encode, and forward as input_audio_buffer.append to OpenAI.response.audio.delta events from OpenAI arrive as 24 kHz PCM16. Downsample to 8 kHz, encode to µ-law, base64-encode, and send to Twilio with the original streamSid.input_audio_buffer.speech_started from OpenAI means the user just interrupted — fire response.cancel upstream and a Twilio clear event downstream so the in-flight TTS audio drains immediately.Skip step 3 and the agent will keep talking over the user for 600–1200 ms.
The CallSphere Sales Calling and After-hours agents both run on this exact pattern. The bridge is a Node.js process supervised by PM2 with Socket.IO carrying live state to the agent dashboard. When a call lands:
This is how we route inbound PSTN traffic to the same OpenAI Realtime model that powers our Healthcare agent.
ws.on("message", async (raw) => {
const evt = JSON.parse(raw.toString());
if (evt.event === "start") {
streamSid = evt.start.streamSid;
openai.send(JSON.stringify({ type: "session.update", session: SESSION }));
}
if (evt.event === "media") {
const pcm24 = upsample8to24(muLawDecode(Buffer.from(evt.media.payload, "base64")));
openai.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: pcm24.toString("base64"),
}));
}
});
openai.on("message", (raw) => {
const evt = JSON.parse(raw.toString());
if (evt.type === "response.audio.delta") {
const mu = muLawEncode(downsample24to8(Buffer.from(evt.delta, "base64")));
ws.send(JSON.stringify({ event: "media", streamSid, media: { payload: mu.toString("base64") } }));
}
if (evt.type === "input_audio_buffer.speech_started") {
openai.send(JSON.stringify({ type: "response.cancel" }));
ws.send(JSON.stringify({ event: "clear", streamSid }));
}
});
<Connect><Stream url="wss://your-bridge/twilio">.ws, FastAPI WebSocket, or Bun) on a publicly reachable HTTPS endpoint.speech_started upstream and emit Twilio clear events to drop queued audio.stop events.Why does my agent stutter for the first second? The Twilio start event arrives before audio. If you forward it as audio, the buffer fills with zeros. Wait for the first media event before opening the OpenAI session.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
How do I handle DTMF? Twilio sends DTMF as separate dtmf events on the same WebSocket. Translate them to text and inject into the OpenAI conversation as a user message.
Do I need to handle reconnection? The Twilio Media Stream cannot reconnect mid-call — Twilio simply closes. The OpenAI side can reconnect, but you lose context. Plan to fail closed.
What is the latency budget? Mic-to-mic latency under 1.2 s feels natural. We see 850–950 ms on a same-region bridge with all six steps optimized.
Can I add transcription mid-call? Yes — OpenAI Realtime emits transcript events automatically; persist them to Postgres on the way through.
CallSphere serves six verticals with PSTN bridges identical to this. Try the 14-day free trial or book a demo.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.
The 2024 NPRM proposes mandatory penetration tests every 12 months and vulnerability scans every 6 months. Here is how an AI voice agent should be tested in 2026.
AWS HealthScribe became the open scribe layer EHR vendors built on top of in 2026. Here's the API surface, the per-encounter pricing, the BAA terms.
On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.
Why Claude salon AI is reshaping voice and chat automation, with concrete patterns for appointment AI in production deployments. A field-tested view from production teams shippi...
Apollo, Manipal, and Narayana scaled AI agents across Bangalore in 2026. Here's the deployments across radiology, intake, and follow-up, the costs.
© 2026 CallSphere LLC. All rights reserved.