By Sagar Shankaran, Founder of CallSphere
On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.
Key takeaways
On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.
flowchart LR
User --> Edge[Cloudflare Edge]
Edge --> WS[(WebSocket Bridge)]
WS --> LLM[OpenAI Realtime gpt-4o]
LLM --> Tool[Tool Call]
Tool --> CRM[(CRM API)]
Tool --> EHR[(EHR API)]
LLM --> UserOn May 4 2026, OpenAI's engineering team published "Delivering low-latency voice AI at scale," documenting how they rearchitected the Realtime API's WebRTC stack to serve 900M+ weekly active ChatGPT users plus the developer Realtime API.
The headline change is a split between two services. A thin edge transceiver terminates the client WebRTC connection — owning ICE, DTLS, SRTP keys, and session lifecycle — and converts media into a simpler internal protocol. A separate relay layer carries that internal protocol over a small set of stable UDP addresses to the inference, transcription, TTS, tool, and orchestration services in the data center.
Why this matters: in classic WebRTC, the model server has to speak the full protocol (jitter buffers, NACK, FEC, RED, congestion control) and live next to the user. In OpenAI's split design, only the edge service speaks WebRTC; the model can sit in the cheapest, hottest GPU pool on the planet without sacrificing the conversational feel.
The April 27 2026 release of the Symphony engineering spec made this pattern public: the transceiver is the only service that owns full WebRTC state.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Three things flow downstream from this change:
For builders, the takeaway is that "model + WebRTC SDK on a single VM" is now the slow architecture. You either consume the hosted Realtime API or you replicate the split-relay pattern with a third-party edge (LiveKit, Daily, Pipecat + Cloudflare).
CallSphere's voice stack is built on OpenAI Realtime (gpt-4o-realtime-preview-2025-06-03) for the Healthcare Voice Agent and the OneRoof Real Estate suite — both consume the new split-relay infra automatically. The Healthcare agent runs FastAPI on :8084 with 14 tools; OneRoof runs 10 specialist agents over the OpenAI Agents SDK with WebRTC.
We measured median voice-to-voice latency drop from 612ms to 437ms across our last 4 weeks of US East and US West Healthcare traffic — a 28.6% improvement we did not have to ship code for. Post-call analytics (sentiment –1.0 to 1.0, lead score 0–100) now arrive 200ms+ sooner because the orchestration tier returns events while audio is still playing.
For non-Realtime products like Salon GlamBook (which uses ElevenLabs TTS/STT plus a separate LLM), we mirrored the pattern: a thin Cloudflare-Workers edge terminates WebRTC and forwards Opus to the inference plane. The same architectural lesson, applied at our scale (37 agents, 90+ tools, 115+ DB tables, 6 verticals, 57+ languages, HIPAA + SOC 2 aligned).
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Try it on the 14-day no-card trial or see the pricing tiers — $149 / $499 / $1499.
websocket -> model design, move WebRTC termination to a dedicated edge service. LiveKit Cloud, Daily Bots, or Cloudflare Realtime all work.What is the OpenAI Realtime API split-relay architecture? A two-tier WebRTC design where an edge transceiver owns the WebRTC session and forwards audio over a simpler internal protocol to centralized inference services. Documented by OpenAI on May 4 2026.
How much latency does it actually save? OpenAI does not publish a single number, but in our production fleet we measured 28.6% lower median voice-to-voice latency on the Healthcare Voice Agent after the rollout completed.
Do I need to change my code to benefit? No. If you consume the hosted Realtime API, the change is server-side. If you self-host the model, you need to adopt a similar split-relay design yourself.
When should I build my own edge versus use a hosted one? Below ~10M minutes per month, hosted edges (LiveKit, Daily, hosted Realtime) are cheaper. Above that, a custom edge starts to pay back via egress and per-minute cost.
Is this the same as Symphony Engineering? Symphony is the broader open-source orchestration spec OpenAI released April 27 2026. The May 4 post is the WebRTC subset of that effort.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to texto a voz (text-to-speech in Spanish): LATAM vs Castilian voices, free options, and how CallSphere ships Spanish agents.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
A founder's guide to the Siri voice generator landscape: how AI voice cloning works, what is legal, and how CallSphere uses 57+ voices in production.
A founder's guide to AI voice assistants for ecommerce: customer service, order lookup, and how CallSphere fits in versus virtual receptionists.
Robot text to speech in 2026: how I pick TTS APIs, when robotic voices help, and how CallSphere ships 57+ language voice agents. Hands-on guide.
The customer support specialist role in 2026 is half human, half AI. Here is what the job looks like, the AI tools that pair with it, and how we ship it.
© 2026 CallSphere LLC. All rights reserved.