Skip to content
AI Engineering
AI Engineering12 min read0 views

Replace Vapi with Twilio Media Streams + GPT-4o Realtime

Vapi orchestrates STT, LLM and TTS as separate services — fast, but you pay 3 vendor markups. Collapse the stack to Twilio + GPT-4o Realtime and own the orchestration.

TL;DR — Vapi is a great prototyping layer (5-minute "hello phone"), but at scale the orchestration cost and the inability to share state across calls becomes painful. Move to Twilio Media Streams + GPT-4o Realtime, keep the same assistant config in code, and lose ~30% of the per-minute cost.

What you'll build

A Node.js service that exposes the same surface as a Vapi assistant (POST /assistant/start, server-tool webhooks, transcript stream) but speaks directly to GPT-4o Realtime. Your existing Vapi tool URLs keep working with a thin shim.

Prerequisites

  1. Vapi account with at least one assistant in production and 30+ days of call.ended webhooks logged.
  2. Twilio account with a programmable voice number.
  3. OpenAI API key with Realtime access.
  4. Node.js 22+, ws, fastify, twilio.
  5. A diff tool — you'll be comparing tool-call sequences.

Architecture

flowchart LR
  C[Caller] -->|PSTN| TW[Twilio Number]
  TW -->|Media Streams WSS| SH[Shim Service]
  SH -->|WSS Realtime| OAI[GPT-4o Realtime]
  SH -->|HTTPS webhooks| TOOLS[Your existing Vapi tool URLs]

Step 1 — Export the Vapi assistant config

```bash curl -H "Authorization: Bearer $VAPI_KEY" \ https://api.vapi.ai/assistant/$ID > assistant.json ```

You only need three sections: model.messages (system prompt), model.tools (function definitions), and voice settings.

Step 2 — Translate to OpenAI session.update

```js import fs from "node:fs"; const v = JSON.parse(fs.readFileSync("assistant.json")); const sessionUpdate = { type: "session.update", session: { instructions: v.model.messages.find(m => m.role === "system").content, voice: v.voice?.voiceId === "rachel" ? "shimmer" : "alloy", input_audio_format: "g711_ulaw", output_audio_format: "g711_ulaw", turn_detection: { type: "server_vad", silence_duration_ms: 320 }, tools: v.model.tools.map(t => ({ type: "function", name: t.function.name, description: t.function.description, parameters: t.function.parameters, })), }, }; ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Wire Twilio Media Streams to OpenAI

```js import Fastify from "fastify"; import websocket from "@fastify/websocket"; import WebSocket from "ws";

const app = Fastify(); await app.register(websocket);

app.get("/media", { websocket: true }, (conn) => { const oai = new WebSocket( "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03", { headers: { Authorization: Bearer ${process.env.OPENAI_API_KEY}, "OpenAI-Beta": "realtime=v1" }}); oai.on("open", () => oai.send(JSON.stringify(sessionUpdate))); conn.socket.on("message", (raw) => { const ev = JSON.parse(raw); if (ev.event === "media") oai.send(JSON.stringify({ type: "input_audio_buffer.append", audio: ev.media.payload })); }); oai.on("message", (raw) => { const ev = JSON.parse(raw); if (ev.type === "response.audio.delta") conn.socket.send(JSON.stringify({ event: "media", media: { payload: ev.delta }})); }); }); ```

Step 4 — Forward tool calls to your existing Vapi webhooks

Vapi posts {message:{toolCalls:[{function:{name,arguments}}]}}. Mirror that envelope so endpoints don't change:

```js async function handleToolCall(name, args, callId) { const url = process.env[TOOL_URL_${name.toUpperCase()}]; const res = await fetch(url, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ message: { toolCalls: [{ id: callId, type: "function", function: { name, arguments: JSON.stringify(args) }}]}}), }); const data = await res.json(); return data.results?.[0]?.result ?? data; } ```

Step 5 — Stream transcripts to your warehouse

OpenAI emits response.audio_transcript.done; capture both sides and write to BigQuery or Postgres for the same dashboards Vapi gave you.

Step 6 — Smoke-test against the Vapi reference

Hit the same DID through both stacks 50 times and diff tool-call order. Anything > 5% drift means your prompt is too implicit — add explicit tool-use guardrails.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 7 — Cut over per-DID

Twilio lets you point each number at a different webhook. Move one number a day for a week.

Common pitfalls

  • Tool schemas with oneOf. OpenAI Realtime is stricter than Vapi — flatten unions.
  • Voice mismatch. Pre-record a 10-second comparison clip per voice to avoid customer surprise.
  • Silent failures. Always log response.error events; Vapi hid these.

How CallSphere does this in production

CallSphere does not use Vapi or any orchestration vendor — every voice path is direct OpenAI Realtime, ElevenLabs or self-hosted Whisper, glued by a CallSphere-owned dispatcher. 37 specialist agents, 90+ tools, 115+ DB tables. Healthcare runs FastAPI on :8084 with HIPAA logging, OneRoof Property dispatches across 10 specialists over WebRTC + Pion + NATS, Salon ships ElevenLabs with GB-YYYYMMDD-### booking IDs. Try the demo or compare on /compare/vapi.

FAQ

Does Vapi block exports? No — assistants and tools are JSON-exportable.

What about Vapi's analytics? Replace with Postgres + Metabase or Honeycomb; richer and cheaper at scale.

Can I keep Vapi for prototyping? Yes — many teams prototype on Vapi, ship on direct.

Latency parity? OpenAI Realtime hits 600–800ms; Vapi runs 750–1100ms. Direct usually wins.

Cost at 50k min/mo? Vapi: ~$5,250. Direct: ~$3,400 + Twilio.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.