By Sagar Shankaran, Founder of CallSphere
Run Whisper, Kokoro, and LFM2.5-Audio entirely in the browser with ONNX Runtime Web + WebGPU. Flash Attention, qMoE, sub-100ms latency on a laptop. Privacy-first voice without a backend.
Key takeaways
TL;DR — ONNX Runtime Web + WebGPU has matured enough in 2026 to run Whisper-tiny + Kokoro + LFM2.5-Audio-1.5B entirely in the browser tab. Recent updates: Flash Attention, graph capture, Split-K MatMul, qMoE support. WebGPU works in Chrome, Edge, Safari (16+), Firefox Nightly. Result: a voice agent that sends zero audio to your servers — perfect for HIPAA, GDPR, or pure latency.
flowchart LR
MIC[Mic Stream] --> ORT[ONNX Runtime Web]
ORT -->|WebGPU| STT[Whisper-tiny INT8]
STT -->|text| LLM{LLM Choice}
LLM -->|local| LFM[LFM2.5-Audio 1.5B in-browser]
LLM -->|remote| API[Cloudflare / Groq]
LFM & API -->|tokens| TTS[Kokoro-82M WebGPU]
TTS -->|PCM| OUT[Audio Output]
CallSphere ships a Browser Voice Widget (<script src="cdn.callsphere.ai/widget.js">) that runs Whisper-tiny + Kokoro fully in-browser, then calls our remote LLM only for the language step. 37 agents · 90+ tools · 115+ DB tables · 6 verticals. Plans: $149 / $499 / $1,499, 14-day /trial, 22% affiliate via /affiliate.
npm install onnxruntime-web@1.21 (latest 2026 build with WebGPU EP improvements).optimum-cli export onnx --model openai/whisper-tiny --task automatic-speech-recognition.onnxruntime.quantization for 4× smaller download.const session = await ort.InferenceSession.create(url, { executionProviders: ['webgpu'] });.AudioWorklet to grab 16kHz mono frames; feed into Whisper session.AudioBufferSourceNode.Q: Can I run Llama in-browser? A: Yes — LFM2.5-Audio-1.5B (Liquid AI) and Phi-3.5-mini run on WebGPU. Speeds: 15–30 tok/s on M2 MacBook.
Q: Whisper-large? A: Too big (1.6GB). Stick with tiny/base in-browser; route long-form to a server.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Q: HIPAA? A: In-browser inference means audio never leaves the device — ideal for /industries/healthcare.
Q: Mobile? A: WebGPU on iOS 17+; Android Chrome 113+. Performance varies wildly by GPU.
Q: Cost? A: Free at runtime — users pay. CallSphere widget is included on Growth plan and above (/pricing).
ONNX Runtime + WebGPU for Browser Voice Agents (No Server, Sub-100ms) ultimately resolves into one engineering question: when do you use the OpenAI Realtime API versus an async pipeline? Realtime wins on latency for live calls. Async wins on cost, retries, and structured tool reliability for callbacks and SMS flows. Most teams need both, and the routing layer between them becomes the most load-bearing piece of the stack.
Production AI agents live or die on three loops: evals, retries, and handoff state. CallSphere runs 37 agents across 6 verticals, each with its own eval suite — synthetic call transcripts replayed nightly with assertion checks on extracted entities (date, time, party size, insurance, address). Without that loop, prompt regressions ship silently and you only find out when bookings drop.
Structured tools beat free-form text every time. Our 90+ function tools all enforce JSON schemas validated server-side; if the model hallucinates an integer where a string is required, we retry with a corrective system message before falling back to a deterministic path. For long-running flows, we treat agent handoffs as a state machine — booking → confirmation → SMS — so context survives turn boundaries.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The Realtime API vs. async decision usually comes down to "is the user holding the phone right now?" If yes, Realtime; if no (callback queue, after-hours voicemail), async wins on cost-per-conversation, which we track per agent in 115+ database tables spanning all 6 verticals.
Why does onnx runtime + webgpu for browser voice agents (no server, sub-100ms) matter for revenue, not just engineering? 57+ languages are supported out of the box, and the platform is HIPAA and SOC 2 aligned, which removes most of the procurement friction in regulated verticals. For a topic like "ONNX Runtime + WebGPU for Browser Voice Agents (No Server, Sub-100ms)", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations.
What are the most common mistakes teams make on day one? Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar.
How does CallSphere's stack handle this differently than a generic chatbot? The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer.
Want to see how this maps to your stack? Book a live walkthrough at calendly.com/sagar-callsphere/new-meeting, or try the vertical-specific demo at urackit.callsphere.tech. 14-day trial, no credit card, pilot live in 3–5 business days.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Receptionist bots need to remember callers across visits without violating privacy. The privacy-aware memory architecture for voice receptionists that scales cleanly.
On-device voice LLMs are now real. What Apple Intelligence, Gemini Nano, and Phi-4 ship in 2026 — and what they cannot do yet.
Cloud-based code assistants ship your whole repo to remote LLMs every few minutes. Code-Review-Graph keeps the index local and only sends what matters — saving tokens, latency, and your IP.
Pre-trained Speech Commands models, ml5.js wrappers, and TensorFlow.js with the WASM/WebGPU backend let you ship a voice agent with wake-word, intent, and tone detection — all client-side.
Production traces are the best eval data you have. A workflow for promoting LangSmith traces into golden datasets without leaking PII or breaking compliance.
Agents that remember users must also forget them on request. The architecture for GDPR/CCPA-compliant deletion across vector stores, graph stores, and trace logs.
© 2026 CallSphere LLC. All rights reserved.