By Sagar Shankaran, Founder of CallSphere
Wire Twilio Media Streams to OpenAI Realtime in under 50 lines of Node.js. Real working code, mu-law to PCM16 transcoding, server VAD, barge-in, and production tips.
Key takeaways
TL;DR — A Twilio inbound number, a Node.js WebSocket bridge, and the
gpt-4o-realtime-preview-2025-06-03model are all you need for a sub-800ms voice agent. The whole bridge fits in 50 lines if you keep it tight.
A working inbound voice agent: a caller dials your Twilio number, Twilio opens a bidirectional Media Stream to your Node.js server, and your server pipes audio to OpenAI Realtime and back. You'll hear the model speak with natural turn-taking, barge-in interruption, and server-side voice activity detection. Total round-trip latency lands between 600ms and 900ms on a US east-coast box.
gpt-4o-realtime-preview-2025-06-03).cloudflared tunnel for dev).npm install ws express — that's it for deps.sequenceDiagram
participant C as Caller (PSTN)
participant T as Twilio
participant B as Bridge (Node.js)
participant O as OpenAI Realtime
C->>T: Dials number
T->>B: HTTP POST /incoming (TwiML)
B-->>T: <Connect><Stream wss://.../media>
T->>B: WS open + start event
T->>B: media frames (mu-law 8k)
B->>O: input_audio_buffer.append (g711_ulaw)
O-->>B: response.audio.delta (g711_ulaw)
B-->>T: media event (base64 mu-law)
T-->>C: speaks audio
When Twilio receives a call, it hits your webhook for TwiML. Return a <Connect><Stream> pointing at your WebSocket:
```xml
<Connect> is bidirectional; <Start> is one-way (caller-to-server only). You almost always want <Connect> for AI agents.
```js import express from "express"; import { WebSocketServer } from "ws"; import http from "http";
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
const app = express();
app.post("/incoming", (_, res) => {
res.type("text/xml").send(`
For each Twilio WebSocket, open a paired OpenAI WebSocket. Use the g711_ulaw audio format on both sides — OpenAI accepts mu-law natively, so no transcoding required.
```js import WebSocket from "ws"; const URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03";
wss.on("connection", (twilio) => { let streamSid = null; const ai = new WebSocket(URL, { headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}`, "OpenAI-Beta": "realtime=v1", }, });
ai.on("open", () => ai.send(JSON.stringify({ type: "session.update", session: { voice: "alloy", input_audio_format: "g711_ulaw", output_audio_format: "g711_ulaw", turn_detection: { type: "server_vad", threshold: 0.5 }, instructions: "You are CallSphere, a friendly receptionist. Keep replies under 2 sentences." } }))); ```
```js twilio.on("message", (raw) => { const m = JSON.parse(raw.toString()); if (m.event === "start") streamSid = m.start.streamSid; if (m.event === "media" && ai.readyState === 1) { ai.send(JSON.stringify({ type: "input_audio_buffer.append", audio: m.media.payload, // already base64 mu-law })); } if (m.event === "stop") ai.close(); }); ```
```js ai.on("message", (raw) => { const e = JSON.parse(raw.toString()); if (e.type === "response.audio.delta" && streamSid) { twilio.send(JSON.stringify({ event: "media", streamSid, media: { payload: e.delta }, // base64 mu-law from OpenAI })); } if (e.type === "input_audio_buffer.speech_started") { // Caller started talking — clear Twilio buffer for true barge-in twilio.send(JSON.stringify({ event: "clear", streamSid })); } }); }); ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Expose port 8080 with cloudflared tunnel --url http://localhost:8080, paste the URL into your Twilio number's Voice config (HTTP POST to /incoming), and dial. You should hear "Hello, this is CallSphere" within a second. Interrupt the model — it should stop instantly because of the clear event in Step 5.
pcm16 on either side means double transcoding. Use g711_ulaw end-to-end with Twilio.streamSid on outbound media: Twilio silently drops it. Capture the value from the start event.clear event, the model keeps talking over the caller. Always wire speech_started.CallSphere's Healthcare receptionist runs the same pattern but at PCM16 24kHz with server VAD threshold 0.55, plus a transcript sidecar that writes every user/assistant turn to Postgres for post-call analytics (sentiment –1.0 to 1.0, lead score 0–100). The Real Estate OneRoof agent uses the OpenAI Agents SDK with a Go gateway and NATS for fan-out. Across 37 production agents, 90+ tools, and 115+ DB tables, this Twilio + Realtime path is the inbound default. Try it on the 14-day trial or see the demo.
Does OpenAI Realtime accept mu-law? Yes — set input_audio_format and output_audio_format to g711_ulaw to skip transcoding entirely.
What's the max call length? OpenAI Realtime sessions cap around 30 minutes by default. For longer calls, persist transcript state and re-open a session.
How do I add tools (booking, lookup)? Add a tools array to session.update with JSON Schema; handle response.function_call_arguments.done to execute and reply with a tool result.
Latency too high? Pin Twilio region close to your bridge, use a server in us-east-1, and avoid streaming through a CDN.
Mu-law or PCM16? Mu-law is fine for telephony fidelity. Use PCM16 24kHz only when the audio path is browser → server → OpenAI.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A VoIP telephone number is a phone number that routes calls over the internet instead of copper lines. Learn what a VoIP number is, how to get one, what it costs, and how to pair it with an AI voice agent in 2026.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.
Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI