By Sagar Shankaran, Founder of CallSphere
Vapi orchestrates STT, LLM and TTS as separate services — fast, but you pay 3 vendor markups. Collapse the stack to Twilio + GPT-4o Realtime and own the orchestration.
Key takeaways
TL;DR — Vapi is a great prototyping layer (5-minute "hello phone"), but at scale the orchestration cost and the inability to share state across calls becomes painful. Move to Twilio Media Streams + GPT-4o Realtime, keep the same assistant config in code, and lose ~30% of the per-minute cost.
A Node.js service that exposes the same surface as a Vapi assistant (POST /assistant/start, server-tool webhooks, transcript stream) but speaks directly to GPT-4o Realtime. Your existing Vapi tool URLs keep working with a thin shim.
call.ended webhooks logged.ws, fastify, twilio.flowchart LR
C[Caller] -->|PSTN| TW[Twilio Number]
TW -->|Media Streams WSS| SH[Shim Service]
SH -->|WSS Realtime| OAI[GPT-4o Realtime]
SH -->|HTTPS webhooks| TOOLS[Your existing Vapi tool URLs]
```bash curl -H "Authorization: Bearer $VAPI_KEY" \ https://api.vapi.ai/assistant/$ID > assistant.json ```
You only need three sections: model.messages (system prompt), model.tools (function definitions), and voice settings.
```js import fs from "node:fs"; const v = JSON.parse(fs.readFileSync("assistant.json")); const sessionUpdate = { type: "session.update", session: { instructions: v.model.messages.find(m => m.role === "system").content, voice: v.voice?.voiceId === "rachel" ? "shimmer" : "alloy", input_audio_format: "g711_ulaw", output_audio_format: "g711_ulaw", turn_detection: { type: "server_vad", silence_duration_ms: 320 }, tools: v.model.tools.map(t => ({ type: "function", name: t.function.name, description: t.function.description, parameters: t.function.parameters, })), }, }; ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
```js import Fastify from "fastify"; import websocket from "@fastify/websocket"; import WebSocket from "ws";
const app = Fastify(); await app.register(websocket);
app.get("/media", { websocket: true }, (conn) => {
const oai = new WebSocket(
"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03",
{ headers: { Authorization: Bearer ${process.env.OPENAI_API_KEY},
"OpenAI-Beta": "realtime=v1" }});
oai.on("open", () => oai.send(JSON.stringify(sessionUpdate)));
conn.socket.on("message", (raw) => {
const ev = JSON.parse(raw);
if (ev.event === "media")
oai.send(JSON.stringify({ type: "input_audio_buffer.append",
audio: ev.media.payload }));
});
oai.on("message", (raw) => {
const ev = JSON.parse(raw);
if (ev.type === "response.audio.delta")
conn.socket.send(JSON.stringify({ event: "media",
media: { payload: ev.delta }}));
});
});
```
Vapi posts {message:{toolCalls:[{function:{name,arguments}}]}}. Mirror that envelope so endpoints don't change:
```js
async function handleToolCall(name, args, callId) {
const url = process.env[TOOL_URL_${name.toUpperCase()}];
const res = await fetch(url, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ message: { toolCalls: [{ id: callId,
type: "function", function: { name, arguments: JSON.stringify(args) }}]}}),
});
const data = await res.json();
return data.results?.[0]?.result ?? data;
}
```
OpenAI emits response.audio_transcript.done; capture both sides and write to BigQuery or Postgres for the same dashboards Vapi gave you.
Hit the same DID through both stacks 50 times and diff tool-call order. Anything > 5% drift means your prompt is too implicit — add explicit tool-use guardrails.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Twilio lets you point each number at a different webhook. Move one number a day for a week.
oneOf. OpenAI Realtime is stricter than Vapi — flatten unions.response.error events; Vapi hid these.CallSphere does not use Vapi or any orchestration vendor — every voice path is direct OpenAI Realtime, ElevenLabs or self-hosted Whisper, glued by a CallSphere-owned dispatcher. 37 specialist agents, 90+ tools, 115+ DB tables. Healthcare runs FastAPI on :8084 with HIPAA logging, OneRoof Property dispatches across 10 specialists over WebRTC + Pion + NATS, Salon ships ElevenLabs with GB-YYYYMMDD-### booking IDs. Try the demo or compare on /compare/vapi.
Does Vapi block exports? No — assistants and tools are JSON-exportable.
What about Vapi's analytics? Replace with Postgres + Metabase or Honeycomb; richer and cheaper at scale.
Can I keep Vapi for prototyping? Yes — many teams prototype on Vapi, ship on direct.
Latency parity? OpenAI Realtime hits 600–800ms; Vapi runs 750–1100ms. Direct usually wins.
Cost at 50k min/mo? Vapi: ~$5,250. Direct: ~$3,400 + Twilio.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A VoIP telephone number is a phone number that routes calls over the internet instead of copper lines. Learn what a VoIP number is, how to get one, what it costs, and how to pair it with an AI voice agent in 2026.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
Vapi 465ms optimal, Retell 580-620ms, Bland ~800ms, ElevenLabs 400-600ms — but those are best-case. We design a fair benchmark harness, P95 measurement, and a reproducible methodology for 2026.
On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI