Skip to content
Technical Guides
Technical Guides9 min read0 views

WebRTC vs WebSocket Voice: CallSphere Architecture Edge Over Vapi

WebRTC vs WebSocket for voice AI: when each transport wins on NAT traversal, jitter, codec choice and latency. CallSphere runs both, Vapi locks you in.

TL;DR

CallSphere runs two transports in production: WebRTC for the Real Estate vertical and WebSocket for Healthcare. The choice is not religious — each transport wins under different network conditions, codecs, and latency budgets. WebRTC owns NAT traversal, jitter resilience, and adaptive bitrate. WebSocket via the OpenAI Realtime API owns determinism, server-side VAD, and clean PCM16 audio frames. Vapi.ai funnels everything through its own opinionated pipeline, which is fine until your callers sit behind a corporate firewall, hit a flaky 4G tower, or need <1s end-to-end latency on a non-PSTN channel. This post walks through when each transport shines and why a single transport is a foot-gun.

The Transport Layer Decides Everything Downstream

A voice AI stack is only as good as the bytes it gets. If the audio arrives jittered, clipped, or codec-mangled, no LLM in the world will produce a clean turn-by-turn conversation. The transport layer is the foundation, and the two real choices in 2026 are WebRTC and WebSocket (with PSTN bridges feeding either).

WebRTC was designed for peer-to-peer real-time media. It has built-in jitter buffers, packet loss concealment, ICE for NAT traversal, DTLS-SRTP for encryption, and Opus as the default codec with adaptive bitrate. It is what Google Meet, Zoom Web, and Discord run on.

WebSocket is a far thinner pipe — a TCP-based duplex stream with framing. There is no built-in jitter handling, no codec negotiation, no NAT magic. You ship raw PCM frames or compressed audio over TLS and the application layer handles everything.

How Vapi Handles Transport

Vapi gives you a single hosted pipeline. You bring your STT, LLM, and TTS choices, and Vapi terminates the audio on your behalf. Their telephony layer maps to PSTN, SIP, and a Web SDK. Under the hood, it's an opinionated WebSocket-style stream into their compute, with their own VAD and barge-in logic. It works well when your call paths look like Vapi's reference architecture: a US/Canada phone number, English caller, simple turn structure.

Where it strains:

  • Corporate NAT and symmetric firewalls. Inbound web sessions from enterprise networks routinely fail without TURN.
  • Mobile network jitter. Without the WebRTC jitter buffer, occasional packet loss surfaces as cut-off words.
  • Custom codecs. You take what the platform serves; you cannot swap to G.722 or wideband Opus on demand.

How CallSphere Picks Transport Per Vertical

CallSphere chooses transport vertical-by-vertical because the network conditions and latency targets differ.

Real Estate runs WebRTC. A 6-container pod (frontend, Go gateway, AI worker, voice server, NATS, Redis) terminates WebRTC at the gateway. Buyers calling from open houses, parking lots, and conference rooms get the jitter resilience and adaptive bitrate the situation demands. Vision payloads (property photos sent mid-call) ride the same data channel.

Healthcare runs WebSocket straight into the OpenAI Realtime API at PCM16, 24kHz mono. Server-side VAD on OpenAI's side handles turn detection. Latency target is <1s end-to-end. The deterministic frame timing of WebSocket plus a high-trust hosted endpoint beats WebRTC for a controlled telephony bridge.

After-Hours, IT Helpdesk, Salon, Sales each pick the transport that matches their dominant call path — PSTN bridges into WebSocket where deterministic timing matters, WebRTC where the caller is on a browser with variable network quality.

Transport Decision Matrix

Concern WebRTC WebSocket (Realtime API)
NAT traversal Native (ICE, STUN, TURN) None — relies on TLS over TCP
Jitter buffer Built-in App must implement
Packet loss concealment Native (Opus PLC) App-level only
Codec flexibility Opus, G.722, PCMU, custom Whatever you frame
Adaptive bitrate Yes No
Latency floor ~50ms RTT under good conditions ~100ms+ TCP RTT
Browser support First-class First-class
Server-side VAD DIY Native via OpenAI Realtime
Encryption DTLS-SRTP mandatory TLS
Vision/data channel Yes, same session Separate channel needed

Sequence Diagrams: Same Call, Two Transports

sequenceDiagram
    participant C as Caller
    participant GW as Go Gateway (WebRTC)
    participant AI as AI Worker
    participant WS as Voice Server (WebSocket)
    participant OAI as OpenAI Realtime
    Note over C,GW: Real Estate path (WebRTC)
    C->>GW: ICE candidate exchange
    C->>GW: DTLS-SRTP handshake
    C->>GW: Opus audio (adaptive bitrate)
    GW->>AI: Decoded PCM frames
    AI->>OAI: PCM16 24kHz over WS
    OAI-->>AI: Response audio + tool call
    AI-->>GW: TTS frames
    GW-->>C: Opus stream back
    Note over C,WS: Healthcare path (WebSocket)
    C->>WS: TLS upgrade to WS
    C->>WS: PCM16 frames
    WS->>OAI: Forward as-is
    OAI-->>WS: Server VAD + response
    WS-->>C: PCM16 back

The diagram shows why we picked each transport: Real Estate needs the gateway to absorb network variance before forwarding clean PCM into OpenAI; Healthcare lets OpenAI handle VAD natively because the network conditions on the hospital side are tightly controlled.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

When WebRTC Wins

  • Browser-originated calls where the caller has Chrome, Edge, Safari, or Firefox.
  • Mobile carriers with variable jitter (4G, 5G with handoffs, hotel Wi-Fi).
  • Vision and data alongside audio in the same session — for example a buyer texting a listing photo mid-call.
  • Corporate firewall traversal that requires TURN over 443.

When WebSocket Wins

  • PSTN-bridged calls where the carrier already cleaned up jitter and NAT.
  • Direct integration with hosted realtime models that expect framed PCM.
  • Deterministic latency targets where you want zero adaptive bitrate decisions.
  • Server-side VAD pipelines where the model itself segments turns.

Mini Code Sketch: PCM16 Frame Sender

ws.binaryType = 'arraybuffer';
ws.onopen = () => {
  const pcm = new Int16Array(2400); // 100ms @ 24kHz
  ws.send(pcm.buffer);
};

A 100ms frame at 24kHz mono PCM16 is exactly 4800 bytes. The OpenAI Realtime API expects this framing; CallSphere's voice servers chunk to it before forwarding. WebRTC, by contrast, never sees raw PCM on the wire — Opus does the compression and the gateway decodes only when needed.

Cost and Operations Tradeoff

WebRTC infrastructure costs more to operate. You run TURN servers, you pay for media relay, you debug ICE failures. WebSocket pipelines cost less but expose you to network fragility. Vapi hides the choice entirely, which is convenient but locks you in. CallSphere exposes the choice because verticals differ. A real-estate WebRTC pod has different SLO targets than a healthcare WebSocket pipeline, and the architecture reflects that.

Engineering teams evaluating voice AI in 2026 should ask their vendor: what transport are you on, and can I change it? If the answer is "you take what we give you," that's a red flag for any non-trivial vertical. CallSphere's answer is "we picked the right transport for your vertical, and we'll show you the trace." Try a demo or read the features overview to see both stacks.

FAQ

Is WebRTC always lower latency than WebSocket?

No. Under clean network conditions and with a hosted endpoint co-located with your gateway, WebSocket can match or beat WebRTC because there is no ICE negotiation tax. WebRTC wins on bad networks; WebSocket wins on controlled ones.

Can CallSphere bridge PSTN into WebRTC?

Yes. Twilio Programmable Voice or any SIP carrier can terminate at our Go gateway and convert to WebRTC for browser handoff, or remain as a PCM stream into the WebSocket pipeline. The choice is made per vertical.

Does Vapi support WebRTC?

Vapi does support WebRTC for browser SDK paths, but the transport selection and tuning are not as exposed as CallSphere's. You cannot opt a vertical into one transport vs the other based on caller geography or codec needs.

What about packet loss handling?

WebRTC's Opus codec includes Packet Loss Concealment that interpolates missing audio. WebSocket pipelines have to implement PLC at the application layer, or accept the gaps. CallSphere's WebSocket pipeline targets controlled networks where PLC is rarely needed.

Why does Healthcare use WebSocket instead of WebRTC?

Healthcare callers route through hospital PBXes which already absorb jitter. The OpenAI Realtime API's server-side VAD is best fed clean PCM16 over WebSocket, and the integration is dramatically simpler than maintaining a WebRTC pod for every call. The right tool for the right network.

Try CallSphere

Operational Lessons from Running Both Transports

After running WebRTC and WebSocket pipelines side by side, a few operational patterns stand out. TURN cost matters. WebRTC sounds great in demos, but a TURN relay pulling 64kbps Opus per call adds up at scale. We co-locate TURN in the same datacenter as the gateway and use long-lived connection reuse to keep cost predictable.

Health checks must understand transport. A 200 OK on an HTTP endpoint says nothing about whether your WebRTC pod can negotiate ICE. We added synthetic call probes that establish a real WebRTC session every 60 seconds and measure first-audio-out latency. The probe catches NAT path failures a port check misses.

Codec choice is not just quality. Opus at 24kbps on a clean line sounds nearly as good as 64kbps and uses a quarter of the bandwidth. We negotiate codec parameters per call based on the SDP offer and the caller's reported network type. WebSocket has no such negotiation, which is fine for the Healthcare pipeline because the carrier already chose the codec.

Observability per transport. WebRTC stats (RTCStatsReport) are rich — jitter, packets lost, round-trip time, audio level. WebSocket gives you frame timestamps and that's it. We emit Prometheus metrics for both, but the WebRTC dashboards are dramatically more useful for diagnosing live call quality. If a customer reports a bad call, the WebRTC trace tells us within minutes whether the problem was network, codec, or model.

Migration Path: WebSocket First, WebRTC When Justified

If you are starting a voice AI build today, our recommendation is WebSocket first. The OpenAI Realtime API is the simpler integration. Carrier-bridged calls absorb the network variance you would otherwise need WebRTC to handle. You can ship a working agent in days, not weeks, and you avoid the operational overhead of TURN servers and ICE debugging.

Add WebRTC when one of three things is true: callers are originating from browsers at scale, vision payloads need to ride alongside audio in one session, or your callers sit in environments where carrier-bridged calls are not the dominant path (open houses, retail floors, conferences). CallSphere's Real Estate vertical hit all three at once, which is why that pod runs WebRTC. Healthcare's clinics never did, which is why that pipeline stays on WebSocket. The decision is per-vertical, and the cost of getting it wrong is mostly engineering time, not user experience.

Try CallSphere

See the dual-transport architecture in production. Book a demo or browse Healthcare and Real Estate deep-dives.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.