By Sagar Shankaran, Founder of CallSphere
Insertable Streams unlocks frame-level access for noise suppression, watermarking, and AI inference inside a WebRTC pipeline. Here is the 2026 production playbook.
Key takeaways
Insertable Streams was the first browser API that let you put your own code between the encoder and the network. For AI voice in 2026 it is the difference between shipping noise suppression as a Chrome flag and shipping it as a feature.
Browsers have always had AEC, AGC, and basic noise suppression baked into the WebRTC pipeline. The problem is that those defaults were tuned for video conferencing two-way calls, not for AI voice agents that have to detect a single user speaking on cellular Wi-Fi from a coffee shop. Insertable Streams (and its successor, the Encoded Transform API) lets you stream raw or encoded frames into a Worker, run a model or DSP block, and stream the result back into the peer connection. The encoder, the network, and the receiving SFU all see your transformed frames; the rest of the WebRTC pipeline does not change.
Production reasons teams reach for Insertable Streams in 2026:
The economic argument is simple: the heaviest improvements in voice quality and privacy now live below the codec, and you cannot reach below the codec without this API.
```mermaid flowchart LR Mic[Microphone] --> Track[MediaStreamTrack] Track --> Sender[RTCRtpSender] Sender -.->|encoded frames| Worker[Web Worker - AI model] Worker -.->|transform stream| Sender Sender --> Net[(Network / SFU)] ```
The flow uses two streams: a `ReadableStream` of frames coming out of the encoder, and a `WritableStream` going back into the packetizer. Your transformer reads, modifies, and writes. Critically, all of this runs off the main thread inside a `Worker`, so model inference does not block UI rendering.
CallSphere uses Insertable Streams in the browser-side voice paths for two of its six verticals (real estate, healthcare, behavioral health, legal, salon, insurance):
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Across 37 agents, 90+ tools, and 115+ database tables we keep the heavy DSP off the server hot path, which lets every agent share the same OpenAI Realtime endpoint. SOC 2 + HIPAA controls cover the recorded artifacts, and the worker model files themselves are signed, hashed, and pinned to a SHA-256 in the manifest. Pricing remains $149/$499/$1499 with the 14-day trial; affiliates earn 22% — see /affiliate.
```ts // main.ts const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); const [track] = stream.getAudioTracks();
const pc = new RTCPeerConnection(); const sender = pc.addTrack(track, stream);
// @ts-expect-error - createEncodedStreams is non-standard but Chromium-supported const { readable, writable } = sender.createEncodedStreams();
const worker = new Worker("/audio-transform.js", { type: "module" }); worker.postMessage({ readable, writable }, [readable, writable]); ```
```ts // audio-transform.js (runs inside the Worker) self.onmessage = ({ data: { readable, writable } }) => { const transformer = new TransformStream({ async transform(chunk, controller) { // chunk is an RTCEncodedAudioFrame const view = new DataView(chunk.data); // ... run noise suppression / watermarking / VAD on view ... controller.enqueue(chunk); }, }); readable.pipeThrough(transformer).pipeTo(writable); }; ```
Does Safari support Insertable Streams? Not yet. As of May 2026 Safari still ships only the older read-only API; full Encoded Transform support is on track for Safari 27.
Does it slow audio down? Each transform adds 1–3 ms in our measurements. Stay under 30 ms total per frame to avoid jitter buffer growth.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Can I run an AI model in the Worker? Yes — onnxruntime-web with WebGPU runs RNNoise-style models comfortably under 5 ms per 10 ms frame on a M2 MacBook Air.
Is this still the right API? For new code, target RTCRtpScriptTransform. For existing Chromium-only code, `createEncodedStreams` is fine until ~2027.
Can I do video transforms with the same API? Yes — `RTCEncodedVideoFrame` exposes the same shape; SFrame examples in the W3C explainer cover both.
Does it stack with E2EE? Yes — an SFrame worker can sit alongside the noise-suppression worker as long as you order them correctly: encrypt last on the sender, decrypt first on the receiver.
Does it work with simulcast? For audio there is no simulcast; for video the answer is yes, with one transform per layer.
Can I drop frames? Yes — call `controller.enqueue` only when you want to keep the frame. Be careful though, dropping audio frames produces audible glitches.
Three production rules survive contact with reality:
We pin every model artifact by SHA-256, refuse to load anything unsigned, and gate model upgrades behind a 2% canary that compares concealment ratio in production before promoting.
Try the WebRTC + Worker path live on our /demo, see the bundle in /pricing, or start a /trial.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to texto a voz (text-to-speech in Spanish): LATAM vs Castilian voices, free options, and how CallSphere ships Spanish agents.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
A founder's guide to the Siri voice generator landscape: how AI voice cloning works, what is legal, and how CallSphere uses 57+ voices in production.
A founder's guide to AI voice assistants for ecommerce: customer service, order lookup, and how CallSphere fits in versus virtual receptionists.
Robot text to speech in 2026: how I pick TTS APIs, when robotic voices help, and how CallSphere ships 57+ language voice agents. Hands-on guide.
The customer support specialist role in 2026 is half human, half AI. Here is what the job looks like, the AI tools that pair with it, and how we ship it.
© 2026 CallSphere LLC. All rights reserved.