By Sagar Shankaran, Founder of CallSphere
WebSocket gives you server-side control. WebRTC gives you sub-second latency. Here is the honest engineering tradeoff for production AI voice agents in 2026.
Key takeaways
WebRTC is the highway. WebSocket is the freight train. Pick the wrong one and you either lose 400 ms of latency or lose half your audit log.
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]It matters because the OpenAI Realtime API exposes both transports and the wrong choice silently degrades either user experience or compliance. WebRTC gets you 150–250 ms first-token latency in browsers because it skips TCP head-of-line blocking and pipes Opus directly through the browser stack. WebSocket gets you a server-mediated path where every frame is observable, redactable, and auditable — which is exactly what HIPAA and SOC 2 want.
We see teams pick WebSocket because it "looks" simpler, then ship a chatbot whose first phoneme arrives 700 ms after the user stops speaking. We also see teams pick WebRTC for a healthcare line and discover six weeks later they cannot answer the auditor's "show me the transcript" question because nothing on the server ever saw the audio. Pick deliberately.
WebSocket runs on TCP. The browser opens a single long-lived connection through your server, and you forward audio frames upstream to OpenAI in 20 ms PCM chunks. Your server sees every byte and every event. You can inject system prompts mid-call, redact PII before it hits the model, and persist a full conversation log. The cost is TCP head-of-line blocking — when one packet stalls, every following packet waits.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
WebRTC runs on UDP via SRTP. The browser opens a peer connection directly to OpenAI's media servers (or to your SFU). Opus is bandwidth-adaptive, jitter-tolerant, and packet-loss-resilient. The cost is opacity: by default your server never sees the media path, so you have to build a parallel control channel for tool calls, transcripts, and audit.
We run both transports across our six verticals because the right answer depends on the surface:
That gives us the latency of WebRTC where the user-perceived experience matters most, and the auditability of WebSocket where compliance matters most.
// Server bridges client WebSocket to OpenAI Realtime WebSocket
import WebSocket from "ws";
export function bridgeRealtime(clientWs: WebSocket, sessionId: string) {
const upstream = new WebSocket(
"wss://api.openai.com/v1/realtime?model=gpt-realtime",
{
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1",
},
}
);
clientWs.on("message", (frame) => upstream.send(frame));
upstream.on("message", (frame) => {
auditLog(sessionId, frame); // server-side capture
clientWs.send(frame);
});
upstream.on("close", () => clientWs.close());
}
Can I use WebRTC on the phone? Not directly. PSTN goes through Twilio Media Streams, which is WebSocket only. If the phone matters, you are bridging WebSocket to OpenAI.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Does WebSocket really add 200 ms? In a same-region deployment with HTTP/2 keepalive, the typical penalty is 80–150 ms versus WebRTC. The 200 ms figure assumes one extra hop and a TCP retransmit.
Can I run both at once? Yes. We do — WebRTC to the user, WebSocket to OpenAI through our server. Best of both worlds at the cost of extra infrastructure.
Does the OpenAI Realtime SDK pick for me? No. You choose by calling either the WebSocket or WebRTC client. Default depends on the SDK language.
What about latency under packet loss? WebRTC degrades gracefully (Opus FEC). WebSocket on TCP retransmits, which can add 100+ ms spikes on bad networks. WebRTC wins on flaky Wi-Fi.
CallSphere ships 37 agents, 90+ tools, and 115+ database tables across six verticals with HIPAA + SOC 2 controls. Start a 14-day trial or book a demo — pricing is $149/$499/$1499.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.
BrowserStack offers 30,000+ real devices; Sauce Labs ships deep Appium automation. Here is how AI voice agent teams use both for WebRTC mobile QA in 2026.
WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.
The 2024 NPRM proposes mandatory penetration tests every 12 months and vulnerability scans every 6 months. Here is how an AI voice agent should be tested in 2026.
AWS HealthScribe became the open scribe layer EHR vendors built on top of in 2026. Here's the API surface, the per-encounter pricing, the BAA terms.
On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.
© 2026 CallSphere LLC. All rights reserved.