Skip to content
AI Voice Agents
AI Voice Agents12 min read0 views

Twilio Media Streams to OpenAI Realtime: A WebSocket Bridge

How to bridge a Twilio Media Streams WebSocket to OpenAI Realtime in production: codec conversion, interruption handling, and the timeouts that actually matter.

Phone calls in 2026 still ride on G.711 µ-law at 8 kHz. OpenAI wants 16-bit PCM at 24 kHz. The bridge between them is the most expensive 200 lines of code in your stack.

What problem does the bridge solve?

flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
CallSphere reference architecture

It solves the impedance mismatch between PSTN telephony and modern AI APIs. Twilio Media Streams hands you an inbound WebSocket carrying base64-encoded µ-law frames every 20 ms. OpenAI Realtime expects a different WebSocket carrying base64-encoded PCM16 frames at a different sample rate, with a different event schema. Neither side knows the other exists.

The bridge is a small server process that opens both connections, transcodes audio in both directions, translates events, handles interruption, and disappears when the call ends. Get it wrong and you get either silence, echo, double-talk, or a 500 ms dead zone every time the user interrupts the agent.

How does the bridge actually run?

A typical production bridge is a Node or FastAPI process listening on a public WebSocket URL. Twilio's TwiML <Connect><Stream> verb tells the carrier to open a WebSocket to that URL when a call comes in. On accept, the bridge opens a second WebSocket to OpenAI Realtime. From there:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  1. Inbound media events from Twilio arrive base64-encoded as 8 kHz µ-law. Decode to PCM16, upsample to 24 kHz, base64-encode, and forward as input_audio_buffer.append to OpenAI.
  2. Outbound response.audio.delta events from OpenAI arrive as 24 kHz PCM16. Downsample to 8 kHz, encode to µ-law, base64-encode, and send to Twilio with the original streamSid.
  3. input_audio_buffer.speech_started from OpenAI means the user just interrupted — fire response.cancel upstream and a Twilio clear event downstream so the in-flight TTS audio drains immediately.

Skip step 3 and the agent will keep talking over the user for 600–1200 ms.

CallSphere's implementation

The CallSphere Sales Calling and After-hours agents both run on this exact pattern. The bridge is a Node.js process supervised by PM2 with Socket.IO carrying live state to the agent dashboard. When a call lands:

  • TwiML routes the carrier WebSocket to our bridge service.
  • The bridge opens an OpenAI Realtime WebSocket and attaches the per-vertical system prompt (we maintain prompts for Healthcare, Real Estate, Sales, Behavioral Health, Salons, and Auto).
  • Audit events stream to Postgres in real time so the dashboard updates within 80 ms.
  • On call end, the bridge closes both sockets, persists the final transcript, and triggers webhooks that fire CRM updates.

This is how we route inbound PSTN traffic to the same OpenAI Realtime model that powers our Healthcare agent.

Code: the bridge in 22 lines

ws.on("message", async (raw) => {
  const evt = JSON.parse(raw.toString());
  if (evt.event === "start") {
    streamSid = evt.start.streamSid;
    openai.send(JSON.stringify({ type: "session.update", session: SESSION }));
  }
  if (evt.event === "media") {
    const pcm24 = upsample8to24(muLawDecode(Buffer.from(evt.media.payload, "base64")));
    openai.send(JSON.stringify({
      type: "input_audio_buffer.append",
      audio: pcm24.toString("base64"),
    }));
  }
});
openai.on("message", (raw) => {
  const evt = JSON.parse(raw.toString());
  if (evt.type === "response.audio.delta") {
    const mu = muLawEncode(downsample24to8(Buffer.from(evt.delta, "base64")));
    ws.send(JSON.stringify({ event: "media", streamSid, media: { payload: mu.toString("base64") } }));
  }
  if (evt.type === "input_audio_buffer.speech_started") {
    openai.send(JSON.stringify({ type: "response.cancel" }));
    ws.send(JSON.stringify({ event: "clear", streamSid }));
  }
});

Build steps

  1. Provision a Twilio number and point its Voice webhook at a TwiML endpoint that returns <Connect><Stream url="wss://your-bridge/twilio">.
  2. Stand up a WebSocket server (Node ws, FastAPI WebSocket, or Bun) on a publicly reachable HTTPS endpoint.
  3. Implement µ-law ↔ PCM16 transcoding in pure code; the µ-law table is 256 entries and never changes.
  4. Build the resampler at 8 kHz ↔ 24 kHz. Use a polyphase filter; nearest-neighbor will sound robotic.
  5. Wire interruption: handle speech_started upstream and emit Twilio clear events to drop queued audio.
  6. Add a 30-second idle ping to both sockets and tear down the bridge on stop events.

FAQ

Why does my agent stutter for the first second? The Twilio start event arrives before audio. If you forward it as audio, the buffer fills with zeros. Wait for the first media event before opening the OpenAI session.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

How do I handle DTMF? Twilio sends DTMF as separate dtmf events on the same WebSocket. Translate them to text and inject into the OpenAI conversation as a user message.

Do I need to handle reconnection? The Twilio Media Stream cannot reconnect mid-call — Twilio simply closes. The OpenAI side can reconnect, but you lose context. Plan to fail closed.

What is the latency budget? Mic-to-mic latency under 1.2 s feels natural. We see 850–950 ms on a same-region bridge with all six steps optimized.

Can I add transcription mid-call? Yes — OpenAI Realtime emits transcript events automatically; persist them to Postgres on the way through.

CallSphere serves six verticals with PSTN bridges identical to this. Try the 14-day free trial or book a demo.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.