---
title: "How to Build a Voice AI Agent in 50 Lines: Twilio + OpenAI Realtime"
description: "Wire Twilio Media Streams to OpenAI Realtime in under 50 lines of Node.js. Real working code, mu-law to PCM16 transcoding, server VAD, barge-in, and production tips."
canonical: https://callsphere.ai/blog/vw1h-build-voice-ai-agent-twilio-openai-realtime-50-lines
category: "AI Voice Agents"
tags: ["Tutorial", "Build", "Twilio", "OpenAI Realtime", "Node.js"]
author: "CallSphere Team"
published: 2026-03-15T00:00:00.000Z
updated: 2026-05-07T06:44:59.522Z
---

# How to Build a Voice AI Agent in 50 Lines: Twilio + OpenAI Realtime

> Wire Twilio Media Streams to OpenAI Realtime in under 50 lines of Node.js. Real working code, mu-law to PCM16 transcoding, server VAD, barge-in, and production tips.

> **TL;DR** — A Twilio inbound number, a Node.js WebSocket bridge, and the `gpt-4o-realtime-preview-2025-06-03` model are all you need for a sub-800ms voice agent. The whole bridge fits in 50 lines if you keep it tight.

## What you'll build

A working inbound voice agent: a caller dials your Twilio number, Twilio opens a bidirectional Media Stream to your Node.js server, and your server pipes audio to OpenAI Realtime and back. You'll hear the model speak with natural turn-taking, barge-in interruption, and server-side voice activity detection. Total round-trip latency lands between 600ms and 900ms on a US east-coast box.

## Prerequisites

1. Twilio account with one purchased phone number ($1.15/mo).
2. OpenAI API key with Realtime access (`gpt-4o-realtime-preview-2025-06-03`).
3. Node.js 20+ and a public HTTPS endpoint (use `cloudflared tunnel` for dev).
4. `npm install ws express` — that's it for deps.
5. Familiarity with mu-law 8kHz audio (Twilio) vs PCM16 24kHz (OpenAI).

## Architecture

```mermaid
sequenceDiagram
  participant C as Caller (PSTN)
  participant T as Twilio
  participant B as Bridge (Node.js)
  participant O as OpenAI Realtime
  C->>T: Dials number
  T->>B: HTTP POST /incoming (TwiML)
  B-->>T:
  T->>B: WS open + start event
  T->>B: media frames (mu-law 8k)
  B->>O: input_audio_buffer.append (g711_ulaw)
  O-->>B: response.audio.delta (g711_ulaw)
  B-->>T: media event (base64 mu-law)
  T-->>C: speaks audio
```

## Step 1 — TwiML to start the stream

When Twilio receives a call, it hits your webhook for TwiML. Return a `` pointing at your WebSocket:

```xml

```

`` is bidirectional; `` is one-way (caller-to-server only). You almost always want `` for AI agents.

## Step 2 — Boot an Express + ws server

```js
import express from "express";
import { WebSocketServer } from "ws";
import http from "http";

const app = express();
app.post("/incoming", (_, res) => {
  res.type("text/xml").send(`

    `);
});
const server = http.createServer(app);
const wss = new WebSocketServer({ server, path: "/media" });
server.listen(8080);
```

## Step 3 — Connect to OpenAI Realtime per call

For each Twilio WebSocket, open a paired OpenAI WebSocket. Use the `g711_ulaw` audio format on both sides — OpenAI accepts mu-law natively, so no transcoding required.

```js
import WebSocket from "ws";
const URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03";

wss.on("connection", (twilio) => {
  let streamSid = null;
  const ai = new WebSocket(URL, {
    headers: {
      Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      "OpenAI-Beta": "realtime=v1",
    },
  });

ai.on("open", () => ai.send(JSON.stringify({
    type: "session.update",
    session: {
      voice: "alloy",
      input_audio_format: "g711_ulaw",
      output_audio_format: "g711_ulaw",
      turn_detection: { type: "server_vad", threshold: 0.5 },
      instructions: "You are CallSphere, a friendly receptionist. Keep replies under 2 sentences."
    }
  })));
```

## Step 4 — Pipe Twilio audio into OpenAI

```js
  twilio.on("message", (raw) => {
    const m = JSON.parse(raw.toString());
    if (m.event === "start") streamSid = m.start.streamSid;
    if (m.event === "media" && ai.readyState === 1) {
      ai.send(JSON.stringify({
        type: "input_audio_buffer.append",
        audio: m.media.payload, // already base64 mu-law
      }));
    }
    if (m.event === "stop") ai.close();
  });
```

## Step 5 — Pipe OpenAI audio back to Twilio

```js
  ai.on("message", (raw) => {
    const e = JSON.parse(raw.toString());
    if (e.type === "response.audio.delta" && streamSid) {
      twilio.send(JSON.stringify({
        event: "media",
        streamSid,
        media: { payload: e.delta }, // base64 mu-law from OpenAI
      }));
    }
    if (e.type === "input_audio_buffer.speech_started") {
      // Caller started talking — clear Twilio buffer for true barge-in
      twilio.send(JSON.stringify({ event: "clear", streamSid }));
    }
  });
});
```

## Step 6 — Test with a real call

Expose port 8080 with `cloudflared tunnel --url http://localhost:8080`, paste the URL into your Twilio number's Voice config (HTTP POST to `/incoming`), and dial. You should hear "Hello, this is CallSphere" within a second. Interrupt the model — it should stop instantly because of the `clear` event in Step 5.

## Common pitfalls

- **Wrong audio format**: defaulting to `pcm16` on either side means double transcoding. Use `g711_ulaw` end-to-end with Twilio.
- **No `streamSid` on outbound media**: Twilio silently drops it. Capture the value from the `start` event.
- **No barge-in**: without the `clear` event, the model keeps talking over the caller. Always wire `speech_started`.
- **One OpenAI socket for many calls**: each concurrent call needs its own WS — Realtime sessions are per-conversation.

## How CallSphere does this in production

CallSphere's Healthcare receptionist runs the same pattern but at PCM16 24kHz with server VAD threshold 0.55, plus a transcript sidecar that writes every user/assistant turn to Postgres for post-call analytics (sentiment –1.0 to 1.0, lead score 0–100). The Real Estate OneRoof agent uses the OpenAI Agents SDK with a Go gateway and NATS for fan-out. Across 37 production agents, 90+ tools, and 115+ DB tables, this Twilio + Realtime path is the inbound default. [Try it on the 14-day trial](/trial) or [see the demo](/demo).

## FAQ

**Does OpenAI Realtime accept mu-law?** Yes — set `input_audio_format` and `output_audio_format` to `g711_ulaw` to skip transcoding entirely.

**What's the max call length?** OpenAI Realtime sessions cap around 30 minutes by default. For longer calls, persist transcript state and re-open a session.

**How do I add tools (booking, lookup)?** Add a `tools` array to `session.update` with JSON Schema; handle `response.function_call_arguments.done` to execute and reply with a tool result.

**Latency too high?** Pin Twilio region close to your bridge, use a server in `us-east-1`, and avoid streaming through a CDN.

**Mu-law or PCM16?** Mu-law is fine for telephony fidelity. Use PCM16 24kHz only when the audio path is browser → server → OpenAI.

## Sources

- [OpenAI Realtime API guide](https://platform.openai.com/docs/guides/realtime)
- [Twilio Media Streams WebSocket messages](https://www.twilio.com/docs/voice/media-streams/websocket-messages)
- [Twilio TwiML Stream verb](https://www.twilio.com/docs/voice/twiml/stream)
- [OpenAI Realtime model card](https://developers.openai.com/api/docs/models/gpt-4o-realtime-preview)
- [Twilio + OpenAI Realtime tutorial (Python)](https://www.twilio.com/en-us/blog/voice-ai-assistant-openai-realtime-api-python)

---

Source: https://callsphere.ai/blog/vw1h-build-voice-ai-agent-twilio-openai-realtime-50-lines