Replace Vapi with Twilio Media Streams + GPT-4o Realtime
Vapi orchestrates STT, LLM and TTS as separate services — fast, but you pay 3 vendor markups. Collapse the stack to Twilio + GPT-4o Realtime and own the orchestration.
TL;DR — Vapi is a great prototyping layer (5-minute "hello phone"), but at scale the orchestration cost and the inability to share state across calls becomes painful. Move to Twilio Media Streams + GPT-4o Realtime, keep the same assistant config in code, and lose ~30% of the per-minute cost.
What you'll build
A Node.js service that exposes the same surface as a Vapi assistant (POST /assistant/start, server-tool webhooks, transcript stream) but speaks directly to GPT-4o Realtime. Your existing Vapi tool URLs keep working with a thin shim.
Prerequisites
- Vapi account with at least one assistant in production and 30+ days of
call.endedwebhooks logged. - Twilio account with a programmable voice number.
- OpenAI API key with Realtime access.
- Node.js 22+,
ws,fastify,twilio. - A diff tool — you'll be comparing tool-call sequences.
Architecture
flowchart LR
C[Caller] -->|PSTN| TW[Twilio Number]
TW -->|Media Streams WSS| SH[Shim Service]
SH -->|WSS Realtime| OAI[GPT-4o Realtime]
SH -->|HTTPS webhooks| TOOLS[Your existing Vapi tool URLs]
Step 1 — Export the Vapi assistant config
```bash curl -H "Authorization: Bearer $VAPI_KEY" \ https://api.vapi.ai/assistant/$ID > assistant.json ```
You only need three sections: model.messages (system prompt), model.tools (function definitions), and voice settings.
Step 2 — Translate to OpenAI session.update
```js import fs from "node:fs"; const v = JSON.parse(fs.readFileSync("assistant.json")); const sessionUpdate = { type: "session.update", session: { instructions: v.model.messages.find(m => m.role === "system").content, voice: v.voice?.voiceId === "rachel" ? "shimmer" : "alloy", input_audio_format: "g711_ulaw", output_audio_format: "g711_ulaw", turn_detection: { type: "server_vad", silence_duration_ms: 320 }, tools: v.model.tools.map(t => ({ type: "function", name: t.function.name, description: t.function.description, parameters: t.function.parameters, })), }, }; ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — Wire Twilio Media Streams to OpenAI
```js import Fastify from "fastify"; import websocket from "@fastify/websocket"; import WebSocket from "ws";
const app = Fastify(); await app.register(websocket);
app.get("/media", { websocket: true }, (conn) => {
const oai = new WebSocket(
"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03",
{ headers: { Authorization: Bearer ${process.env.OPENAI_API_KEY},
"OpenAI-Beta": "realtime=v1" }});
oai.on("open", () => oai.send(JSON.stringify(sessionUpdate)));
conn.socket.on("message", (raw) => {
const ev = JSON.parse(raw);
if (ev.event === "media")
oai.send(JSON.stringify({ type: "input_audio_buffer.append",
audio: ev.media.payload }));
});
oai.on("message", (raw) => {
const ev = JSON.parse(raw);
if (ev.type === "response.audio.delta")
conn.socket.send(JSON.stringify({ event: "media",
media: { payload: ev.delta }}));
});
});
```
Step 4 — Forward tool calls to your existing Vapi webhooks
Vapi posts {message:{toolCalls:[{function:{name,arguments}}]}}. Mirror that envelope so endpoints don't change:
```js
async function handleToolCall(name, args, callId) {
const url = process.env[TOOL_URL_${name.toUpperCase()}];
const res = await fetch(url, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ message: { toolCalls: [{ id: callId,
type: "function", function: { name, arguments: JSON.stringify(args) }}]}}),
});
const data = await res.json();
return data.results?.[0]?.result ?? data;
}
```
Step 5 — Stream transcripts to your warehouse
OpenAI emits response.audio_transcript.done; capture both sides and write to BigQuery or Postgres for the same dashboards Vapi gave you.
Step 6 — Smoke-test against the Vapi reference
Hit the same DID through both stacks 50 times and diff tool-call order. Anything > 5% drift means your prompt is too implicit — add explicit tool-use guardrails.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 7 — Cut over per-DID
Twilio lets you point each number at a different webhook. Move one number a day for a week.
Common pitfalls
- Tool schemas with
oneOf. OpenAI Realtime is stricter than Vapi — flatten unions. - Voice mismatch. Pre-record a 10-second comparison clip per voice to avoid customer surprise.
- Silent failures. Always log
response.errorevents; Vapi hid these.
How CallSphere does this in production
CallSphere does not use Vapi or any orchestration vendor — every voice path is direct OpenAI Realtime, ElevenLabs or self-hosted Whisper, glued by a CallSphere-owned dispatcher. 37 specialist agents, 90+ tools, 115+ DB tables. Healthcare runs FastAPI on :8084 with HIPAA logging, OneRoof Property dispatches across 10 specialists over WebRTC + Pion + NATS, Salon ships ElevenLabs with GB-YYYYMMDD-### booking IDs. Try the demo or compare on /compare/vapi.
FAQ
Does Vapi block exports? No — assistants and tools are JSON-exportable.
What about Vapi's analytics? Replace with Postgres + Metabase or Honeycomb; richer and cheaper at scale.
Can I keep Vapi for prototyping? Yes — many teams prototype on Vapi, ship on direct.
Latency parity? OpenAI Realtime hits 600–800ms; Vapi runs 750–1100ms. Direct usually wins.
Cost at 50k min/mo? Vapi: ~$5,250. Direct: ~$3,400 + Twilio.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.