Skip to content
AI Voice Agents
AI Voice Agents11 min0 views

OpenAI Realtime API: WebRTC vs WebSocket — When to Pick Which in 2026

OpenAI's Realtime API speaks both WebRTC and WebSocket. Here is the production playbook CallSphere uses across 37 agents to decide which transport fits which call path.

OpenAI documents two transports for the Realtime API: WebRTC and WebSocket. The right answer is not "pick one." It is "pick per hop." CallSphere uses WebRTC at the browser edge and WebSocket on every server-to-server hop.

What it is and why now

flowchart LR
  Mobile[iOS / Android SDK] --> WHIP[WHIP ingest]
  WHIP --> Mux[Mux / LiveKit]
  Mux --> Brain[AI brain]
  Brain --> WHEP[WHEP egress]
  WHEP --> Web[Web viewer]
CallSphere reference architecture

The Realtime API exposes `gpt-realtime` (and `gpt-4o-realtime`) over two wire formats. WebRTC is the recommended path for browsers and mobile clients; WebSocket is the recommended path for middle-tier servers running inside controlled networks. The difference is not academic. WebRTC runs over UDP/SRTP with a built-in jitter buffer, AEC, AGC, and noise suppression. WebSocket runs over TCP — every dropped packet stalls the stream while it retransmits, which is fine for tokens but devastating for live audio.

In 2026 the question matters more than ever because most teams have started embedding voice in marketing pages and product UIs, not just in phone systems. A live page-embed running over WebSocket on flaky Wi-Fi sounds like a 1990s VoIP call. The same path over WebRTC sounds like FaceTime.

How WebRTC fits AI voice (architecture)

The peer connection lifecycle for a Realtime call:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  1. Browser asks your server for an ephemeral session token (so the long-lived API key never leaves the backend).
  2. Browser creates an `RTCPeerConnection`, captures the mic into a track, and creates a data channel for events.
  3. ICE gathers candidates (host, server-reflexive via STUN, relay via TURN).
  4. SDP offer is sent to OpenAI's Realtime endpoint with the ephemeral token; OpenAI returns the SDP answer.
  5. SRTP carries Opus audio both directions; the data channel carries JSON events (`response.create`, `input_audio_buffer.commit`, function-call deltas).

Where WebSocket wins: server agents that need to mediate tool calls, redact PII, write audit logs, or talk to phone networks. There the server keeps a WebSocket open to OpenAI and bridges audio to whichever transport the user uses.

CallSphere implementation

CallSphere runs both transports in production:

  • Browser /demo — WebRTC peer connection straight to OpenAI Realtime with an ephemeral key minted by our Next.js API route. No backend audio relay. End-to-end median first-audio is 380 ms.
  • Real Estate (OneRoof) — Browser dials in over WebRTC; our Go gateway 1.23 keeps a WebSocket to Realtime so it can fan out tool calls to NATS and the 6-container pod (CRM writer, calendar, MLS lookup, SMS, audit, transcript).
  • Healthcare — Same pattern, HIPAA-isolated. The WebSocket leg lives entirely inside the VPC so audit and PHI redaction happen before anything leaves us.

Across 37 agents and 90+ tools we touch 115+ database tables. The ability to keep WebRTC at the user edge while WebSocket carries the controlled-network legs is what lets us claim sub-second perceived latency without giving up SOC 2 controls.

Code snippet (TypeScript, browser side)

```ts async function startRealtime() { const tokenRes = await fetch("/api/realtime/token"); const { client_secret } = await tokenRes.json();

const pc = new RTCPeerConnection(); const audioEl = document.createElement("audio"); audioEl.autoplay = true; pc.ontrack = (e) => { audioEl.srcObject = e.streams[0]; };

const mic = await navigator.mediaDevices.getUserMedia({ audio: true }); pc.addTrack(mic.getAudioTracks()[0]);

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

const dc = pc.createDataChannel("oai-events"); dc.onmessage = (e) => console.log("event", JSON.parse(e.data));

const offer = await pc.createOffer(); await pc.setLocalDescription(offer);

const res = await fetch("https://api.openai.com/v1/realtime?model=gpt-realtime", { method: "POST", body: offer.sdp, headers: { Authorization: `Bearer ${client_secret}`, "Content-Type": "application/sdp" }, }); await pc.setRemoteDescription({ type: "answer", sdp: await res.text() }); } ```

Build / migration steps

  1. Mint short-lived ephemeral tokens server-side; never ship long-lived keys to the browser.
  2. Stand up an `RTCPeerConnection` in the client; attach the mic; create a data channel for events.
  3. Generate the SDP offer, POST it to the Realtime SDP endpoint, set the answer.
  4. For controlled-network hops (telephony bridges, agent workers), open a WebSocket from your server to `wss://api.openai.com/v1/realtime`.
  5. Wire tool calls and audit logging on the WebSocket leg only; keep the browser leg pure transport.
  6. Add a `getStats` poller for the peer connection to track packet loss and jitter (we sample every 2 s).

FAQ

Can I run both at once? Yes. CallSphere uses WebRTC client-side and WebSocket server-side; they connect to the same model. Does WebRTC work on iOS Safari? Yes since iOS 11, and Safari 26.4 (March 2026) ships first-party WebTransport too. What about telephony? Phone calls hit our SIP gateway, which bridges into a WebSocket Realtime session. Browser callers stay on WebRTC. Do I still need TURN? Yes — about 8–10% of users live behind symmetric NATs that fail STUN. How long is a session? OpenAI caps Realtime sessions at 30 minutes; renew before that.

Sources

Try the WebRTC path live on our /demo, see the bundle in /pricing, or start a /trial.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Voice Agents

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)

BrowserStack offers 30,000+ real devices; Sauce Labs ships deep Appium automation. Here is how AI voice agent teams use both for WebRTC mobile QA in 2026.

Agentic AI

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

How to stream tokens, tool-call deltas, and intermediate steps from an agent — with code for both the OpenAI Agents SDK and LangChain — and the gotchas that bite in production.

Agentic AI

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

Streaming changes the eval game — final-answer correctness isn't enough when users perceive the answer one token at a time. Here's the metric set that matters.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Logistics Dispatch Voice Agent 2026: Driver Hotline + Load Assignment Hands-Free

Trucking dispatchers spend half their day on check-calls. Here is how a 2026 AI voice agent runs the driver hotline, assigns loads, and updates the TMS in real time.