TL;DR — A Twilio inbound number, a Node.js WebSocket bridge, and the gpt-4o-realtime-preview-2025-06-03 model are all you need for a sub-800ms voice agent. The whole bridge fits in 50 lines if you keep it tight.

What you'll build

A working inbound voice agent: a caller dials your Twilio number, Twilio opens a bidirectional Media Stream to your Node.js server, and your server pipes audio to OpenAI Realtime and back. You'll hear the model speak with natural turn-taking, barge-in interruption, and server-side voice activity detection. Total round-trip latency lands between 600ms and 900ms on a US east-coast box.

Prerequisites

Twilio account with one purchased phone number ($1.15/mo).
OpenAI API key with Realtime access (gpt-4o-realtime-preview-2025-06-03).
Node.js 20+ and a public HTTPS endpoint (use cloudflared tunnel for dev).
npm install ws express — that's it for deps.
Familiarity with mu-law 8kHz audio (Twilio) vs PCM16 24kHz (OpenAI).

Architecture

sequenceDiagram
  participant C as Caller (PSTN)
  participant T as Twilio
  participant B as Bridge (Node.js)
  participant O as OpenAI Realtime
  C->>T: Dials number
  T->>B: HTTP POST /incoming (TwiML)
  B-->>T: <Connect><Stream wss://.../media>
  T->>B: WS open + start event
  T->>B: media frames (mu-law 8k)
  B->>O: input_audio_buffer.append (g711_ulaw)
  O-->>B: response.audio.delta (g711_ulaw)
  B-->>T: media event (base64 mu-law)
  T-->>C: speaks audio

Step 1 — TwiML to start the stream

When Twilio receives a call, it hits your webhook for TwiML. Return a <Connect><Stream> pointing at your WebSocket:

```xml ```

<Connect> is bidirectional; <Start> is one-way (caller-to-server only). You almost always want <Connect> for AI agents.

Step 2 — Boot an Express + ws server

```js import express from "express"; import { WebSocketServer } from "ws"; import http from "http";

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

const app = express(); app.post("/incoming", (_, res) => { res.type("text/xml").send(` `); }); const server = http.createServer(app); const wss = new WebSocketServer({ server, path: "/media" }); server.listen(8080); ```

Step 3 — Connect to OpenAI Realtime per call

For each Twilio WebSocket, open a paired OpenAI WebSocket. Use the g711_ulaw audio format on both sides — OpenAI accepts mu-law natively, so no transcoding required.

```js import WebSocket from "ws"; const URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03";

wss.on("connection", (twilio) => { let streamSid = null; const ai = new WebSocket(URL, { headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}`, "OpenAI-Beta": "realtime=v1", }, });

ai.on("open", () => ai.send(JSON.stringify({ type: "session.update", session: { voice: "alloy", input_audio_format: "g711_ulaw", output_audio_format: "g711_ulaw", turn_detection: { type: "server_vad", threshold: 0.5 }, instructions: "You are CallSphere, a friendly receptionist. Keep replies under 2 sentences." } }))); ```

Step 4 — Pipe Twilio audio into OpenAI

```js twilio.on("message", (raw) => { const m = JSON.parse(raw.toString()); if (m.event === "start") streamSid = m.start.streamSid; if (m.event === "media" && ai.readyState === 1) { ai.send(JSON.stringify({ type: "input_audio_buffer.append", audio: m.media.payload, // already base64 mu-law })); } if (m.event === "stop") ai.close(); }); ```

Step 5 — Pipe OpenAI audio back to Twilio

```js ai.on("message", (raw) => { const e = JSON.parse(raw.toString()); if (e.type === "response.audio.delta" && streamSid) { twilio.send(JSON.stringify({ event: "media", streamSid, media: { payload: e.delta }, // base64 mu-law from OpenAI })); } if (e.type === "input_audio_buffer.speech_started") { // Caller started talking — clear Twilio buffer for true barge-in twilio.send(JSON.stringify({ event: "clear", streamSid })); } }); }); ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Step 6 — Test with a real call

Expose port 8080 with cloudflared tunnel --url http://localhost:8080, paste the URL into your Twilio number's Voice config (HTTP POST to /incoming), and dial. You should hear "Hello, this is CallSphere" within a second. Interrupt the model — it should stop instantly because of the clear event in Step 5.

Common pitfalls

Wrong audio format: defaulting to pcm16 on either side means double transcoding. Use g711_ulaw end-to-end with Twilio.
No streamSid on outbound media: Twilio silently drops it. Capture the value from the start event.
No barge-in: without the clear event, the model keeps talking over the caller. Always wire speech_started.
One OpenAI socket for many calls: each concurrent call needs its own WS — Realtime sessions are per-conversation.

How CallSphere does this in production

CallSphere's Healthcare receptionist runs the same pattern but at PCM16 24kHz with server VAD threshold 0.55, plus a transcript sidecar that writes every user/assistant turn to Postgres for post-call analytics (sentiment –1.0 to 1.0, lead score 0–100). The Real Estate OneRoof agent uses the OpenAI Agents SDK with a Go gateway and NATS for fan-out. Across 37 production agents, 90+ tools, and 115+ DB tables, this Twilio + Realtime path is the inbound default. Try it on the 14-day trial or see the demo.

FAQ

Does OpenAI Realtime accept mu-law? Yes — set input_audio_format and output_audio_format to g711_ulaw to skip transcoding entirely.

What's the max call length? OpenAI Realtime sessions cap around 30 minutes by default. For longer calls, persist transcript state and re-open a session.

How do I add tools (booking, lookup)? Add a tools array to session.update with JSON Schema; handle response.function_call_arguments.done to execute and reply with a tool result.

Latency too high? Pin Twilio region close to your bridge, use a server in us-east-1, and avoid streaming through a CDN.

Mu-law or PCM16? Mu-law is fine for telephony fidelity. Use PCM16 24kHz only when the audio path is browser → server → OpenAI.

How to Build a Voice AI Agent in 50 Lines: Twilio + OpenAI Realtime

What you'll build

Prerequisites

Architecture

Step 1 — TwiML to start the stream

Step 2 — Boot an Express + ws server

Step 3 — Connect to OpenAI Realtime per call

Step 4 — Pipe Twilio audio into OpenAI

Step 5 — Pipe OpenAI audio back to Twilio

Step 6 — Test with a real call

Common pitfalls

How CallSphere does this in production

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

How to Build Voice Agent CI/CD with Evals as Gate (GitHub Actions)

Build a CallSphere-Style Outbound Voice Campaign Tool

Build a CallSphere-Style Multi-Agent for HVAC Dispatch