OpenAI Realtime: WebSocket vs WebRTC Tradeoffs in 2026
WebSocket gives you server-side control. WebRTC gives you sub-second latency. Here is the honest engineering tradeoff for production AI voice agents in 2026.
WebRTC is the highway. WebSocket is the freight train. Pick the wrong one and you either lose 400 ms of latency or lose half your audit log.
Why does this choice matter for production voice agents?
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]It matters because the OpenAI Realtime API exposes both transports and the wrong choice silently degrades either user experience or compliance. WebRTC gets you 150–250 ms first-token latency in browsers because it skips TCP head-of-line blocking and pipes Opus directly through the browser stack. WebSocket gets you a server-mediated path where every frame is observable, redactable, and auditable — which is exactly what HIPAA and SOC 2 want.
We see teams pick WebSocket because it "looks" simpler, then ship a chatbot whose first phoneme arrives 700 ms after the user stops speaking. We also see teams pick WebRTC for a healthcare line and discover six weeks later they cannot answer the auditor's "show me the transcript" question because nothing on the server ever saw the audio. Pick deliberately.
How does each transport actually work?
WebSocket runs on TCP. The browser opens a single long-lived connection through your server, and you forward audio frames upstream to OpenAI in 20 ms PCM chunks. Your server sees every byte and every event. You can inject system prompts mid-call, redact PII before it hits the model, and persist a full conversation log. The cost is TCP head-of-line blocking — when one packet stalls, every following packet waits.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
WebRTC runs on UDP via SRTP. The browser opens a peer connection directly to OpenAI's media servers (or to your SFU). Opus is bandwidth-adaptive, jitter-tolerant, and packet-loss-resilient. The cost is opacity: by default your server never sees the media path, so you have to build a parallel control channel for tool calls, transcripts, and audit.
CallSphere's implementation
We run both transports across our six verticals because the right answer depends on the surface:
- Healthcare voice agent — OpenAI Realtime over WebSocket on FastAPI port 8084. HIPAA needs full server-side observability of every patient utterance.
- Real Estate voice agent — WebRTC for browser-to-agent calls. Zero server-side audio path, lowest latency for showing properties live.
- Sales Calling and After-hours — Twilio Media Streams over WebSocket because these are PSTN inbound calls; there is no browser, only a phone.
- Sales Calling dashboard — Socket.IO WebSocket for the agent dashboard, propagating live call state to 37 agents and their managers.
That gives us the latency of WebRTC where the user-perceived experience matters most, and the auditability of WebSocket where compliance matters most.
Code: hybrid pattern with server-side WebSocket bridge
// Server bridges client WebSocket to OpenAI Realtime WebSocket
import WebSocket from "ws";
export function bridgeRealtime(clientWs: WebSocket, sessionId: string) {
const upstream = new WebSocket(
"wss://api.openai.com/v1/realtime?model=gpt-realtime",
{
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1",
},
}
);
clientWs.on("message", (frame) => upstream.send(frame));
upstream.on("message", (frame) => {
auditLog(sessionId, frame); // server-side capture
clientWs.send(frame);
});
upstream.on("close", () => clientWs.close());
}
Build steps
- Decide per surface: browser + low latency + no compliance bar = WebRTC; phone or compliance = WebSocket.
- For WebSocket, run a thin bridge server (FastAPI or Node) that owns the OpenAI side of the connection and never exposes the API key to the client.
- Implement a control channel either way: tool calls, interruption, and conversation state should travel as JSON events, never inferred from audio.
- Capture audit events server-side. Even WebRTC deployments need a parallel WebSocket for transcripts.
- Set first-byte and first-audio SLOs (we use 250 ms for WebRTC, 450 ms for WebSocket) and alert on regressions.
FAQ
Can I use WebRTC on the phone? Not directly. PSTN goes through Twilio Media Streams, which is WebSocket only. If the phone matters, you are bridging WebSocket to OpenAI.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Does WebSocket really add 200 ms? In a same-region deployment with HTTP/2 keepalive, the typical penalty is 80–150 ms versus WebRTC. The 200 ms figure assumes one extra hop and a TCP retransmit.
Can I run both at once? Yes. We do — WebRTC to the user, WebSocket to OpenAI through our server. Best of both worlds at the cost of extra infrastructure.
Does the OpenAI Realtime SDK pick for me? No. You choose by calling either the WebSocket or WebRTC client. Default depends on the SDK language.
What about latency under packet loss? WebRTC degrades gracefully (Opus FEC). WebSocket on TCP retransmits, which can add 100+ ms spikes on bad networks. WebRTC wins on flaky Wi-Fi.
CallSphere ships 37 agents, 90+ tools, and 115+ database tables across six verticals with HIPAA + SOC 2 controls. Start a 14-day trial or book a demo — pricing is $149/$499/$1499.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.