How to Build a Voice AI Agent in 50 Lines: Twilio + OpenAI Realtime
Wire Twilio Media Streams to OpenAI Realtime in under 50 lines of Node.js. Real working code, mu-law to PCM16 transcoding, server VAD, barge-in, and production tips.
TL;DR — A Twilio inbound number, a Node.js WebSocket bridge, and the
gpt-4o-realtime-preview-2025-06-03model are all you need for a sub-800ms voice agent. The whole bridge fits in 50 lines if you keep it tight.
What you'll build
A working inbound voice agent: a caller dials your Twilio number, Twilio opens a bidirectional Media Stream to your Node.js server, and your server pipes audio to OpenAI Realtime and back. You'll hear the model speak with natural turn-taking, barge-in interruption, and server-side voice activity detection. Total round-trip latency lands between 600ms and 900ms on a US east-coast box.
Prerequisites
- Twilio account with one purchased phone number ($1.15/mo).
- OpenAI API key with Realtime access (
gpt-4o-realtime-preview-2025-06-03). - Node.js 20+ and a public HTTPS endpoint (use
cloudflared tunnelfor dev). npm install ws express— that's it for deps.- Familiarity with mu-law 8kHz audio (Twilio) vs PCM16 24kHz (OpenAI).
Architecture
sequenceDiagram
participant C as Caller (PSTN)
participant T as Twilio
participant B as Bridge (Node.js)
participant O as OpenAI Realtime
C->>T: Dials number
T->>B: HTTP POST /incoming (TwiML)
B-->>T: <Connect><Stream wss://.../media>
T->>B: WS open + start event
T->>B: media frames (mu-law 8k)
B->>O: input_audio_buffer.append (g711_ulaw)
O-->>B: response.audio.delta (g711_ulaw)
B-->>T: media event (base64 mu-law)
T-->>C: speaks audio
Step 1 — TwiML to start the stream
When Twilio receives a call, it hits your webhook for TwiML. Return a <Connect><Stream> pointing at your WebSocket:
```xml
<Connect> is bidirectional; <Start> is one-way (caller-to-server only). You almost always want <Connect> for AI agents.
Step 2 — Boot an Express + ws server
```js import express from "express"; import { WebSocketServer } from "ws"; import http from "http";
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
const app = express();
app.post("/incoming", (_, res) => {
res.type("text/xml").send(`
Step 3 — Connect to OpenAI Realtime per call
For each Twilio WebSocket, open a paired OpenAI WebSocket. Use the g711_ulaw audio format on both sides — OpenAI accepts mu-law natively, so no transcoding required.
```js import WebSocket from "ws"; const URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03";
wss.on("connection", (twilio) => { let streamSid = null; const ai = new WebSocket(URL, { headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}`, "OpenAI-Beta": "realtime=v1", }, });
ai.on("open", () => ai.send(JSON.stringify({ type: "session.update", session: { voice: "alloy", input_audio_format: "g711_ulaw", output_audio_format: "g711_ulaw", turn_detection: { type: "server_vad", threshold: 0.5 }, instructions: "You are CallSphere, a friendly receptionist. Keep replies under 2 sentences." } }))); ```
Step 4 — Pipe Twilio audio into OpenAI
```js twilio.on("message", (raw) => { const m = JSON.parse(raw.toString()); if (m.event === "start") streamSid = m.start.streamSid; if (m.event === "media" && ai.readyState === 1) { ai.send(JSON.stringify({ type: "input_audio_buffer.append", audio: m.media.payload, // already base64 mu-law })); } if (m.event === "stop") ai.close(); }); ```
Step 5 — Pipe OpenAI audio back to Twilio
```js ai.on("message", (raw) => { const e = JSON.parse(raw.toString()); if (e.type === "response.audio.delta" && streamSid) { twilio.send(JSON.stringify({ event: "media", streamSid, media: { payload: e.delta }, // base64 mu-law from OpenAI })); } if (e.type === "input_audio_buffer.speech_started") { // Caller started talking — clear Twilio buffer for true barge-in twilio.send(JSON.stringify({ event: "clear", streamSid })); } }); }); ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Test with a real call
Expose port 8080 with cloudflared tunnel --url http://localhost:8080, paste the URL into your Twilio number's Voice config (HTTP POST to /incoming), and dial. You should hear "Hello, this is CallSphere" within a second. Interrupt the model — it should stop instantly because of the clear event in Step 5.
Common pitfalls
- Wrong audio format: defaulting to
pcm16on either side means double transcoding. Useg711_ulawend-to-end with Twilio. - No
streamSidon outbound media: Twilio silently drops it. Capture the value from thestartevent. - No barge-in: without the
clearevent, the model keeps talking over the caller. Always wirespeech_started. - One OpenAI socket for many calls: each concurrent call needs its own WS — Realtime sessions are per-conversation.
How CallSphere does this in production
CallSphere's Healthcare receptionist runs the same pattern but at PCM16 24kHz with server VAD threshold 0.55, plus a transcript sidecar that writes every user/assistant turn to Postgres for post-call analytics (sentiment –1.0 to 1.0, lead score 0–100). The Real Estate OneRoof agent uses the OpenAI Agents SDK with a Go gateway and NATS for fan-out. Across 37 production agents, 90+ tools, and 115+ DB tables, this Twilio + Realtime path is the inbound default. Try it on the 14-day trial or see the demo.
FAQ
Does OpenAI Realtime accept mu-law? Yes — set input_audio_format and output_audio_format to g711_ulaw to skip transcoding entirely.
What's the max call length? OpenAI Realtime sessions cap around 30 minutes by default. For longer calls, persist transcript state and re-open a session.
How do I add tools (booking, lookup)? Add a tools array to session.update with JSON Schema; handle response.function_call_arguments.done to execute and reply with a tool result.
Latency too high? Pin Twilio region close to your bridge, use a server in us-east-1, and avoid streaming through a CDN.
Mu-law or PCM16? Mu-law is fine for telephony fidelity. Use PCM16 24kHz only when the audio path is browser → server → OpenAI.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.