By Sagar Shankaran, Founder of CallSphere
AssemblyAI Universal-3 Pro Streaming returns immutable transcripts in ~300ms. Build a raw-WebSocket voice agent — Node.js code, endpointing, pitfalls.
Key takeaways
TL;DR — AssemblyAI's Universal-3 Pro Streaming is purpose-built for voice agents in 2026: 307ms p50 latency, immutable transcripts (no flicker), intelligent endpointing, and unlimited concurrency. The simplest possible voice agent is just three WebSockets — no framework needed.
A 100-line Node.js server that streams browser mic audio to AssemblyAI, sends finalized transcripts to GPT-4o, and pipes the response into ElevenLabs streaming TTS — all over plain WebSockets.
flowchart LR
BR[Browser mic] -- WS PCM16 16k --> SV[Node server]
SV -- WS --> AA[AssemblyAI Universal-3]
AA -- final transcript --> SV
SV --> OA[OpenAI GPT-4o stream]
OA -- text deltas --> SV --> EL[ElevenLabs WS TTS]
EL -- audio --> SV --> BR
```bash npm i ws assemblyai openai @elevenlabs/elevenlabs-js ```
```ts import { AssemblyAI } from "assemblyai";
const aai = new AssemblyAI({ apiKey: process.env.AAI_KEY! }); const stt = aai.streaming.transcriber({ sampleRate: 16000, formatTurns: true, endOfTurnConfidenceThreshold: 0.7, minEndOfTurnSilenceWhenConfident: 200, }); stt.on("turn", async (turn) => { if (!turn.end_of_turn) return; await onUserTurn(turn.transcript); }); await stt.connect(); ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
```ts // client.ts const ctx = new AudioContext({ sampleRate: 16000 }); const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); await ctx.audioWorklet.addModule("/pcm-worklet.js"); const node = new AudioWorkletNode(ctx, "pcm"); ctx.createMediaStreamSource(stream).connect(node); const ws = new WebSocket("ws://localhost:8080"); node.port.onmessage = (e) => ws.readyState === 1 && ws.send(e.data); ```
```ts import { WebSocketServer } from "ws"; const wss = new WebSocketServer({ port: 8080 }); wss.on("connection", (client) => { client.on("message", (chunk) => stt.sendAudio(new Uint8Array(chunk as Buffer))); globalThis.replyTo = (audio: Buffer) => client.send(audio); }); ```
```ts import OpenAI from "openai"; import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
const oa = new OpenAI(); const eleven = new ElevenLabsClient(); const history: any[] = [{ role: "system", content: "You are a concierge." }];
async function onUserTurn(text: string) { history.push({ role: "user", content: text }); const stream = await oa.chat.completions.create({ model: "gpt-4o", messages: history, stream: true, }); let buffer = ""; for await (const c of stream) { const delta = c.choices[0]?.delta?.content ?? ""; buffer += delta; if (/[.!?]\s/.test(buffer)) { // sentence boundary → speak await speak(buffer); buffer = ""; } } if (buffer) await speak(buffer); }
async function speak(text: string) { const audio = await eleven.textToSpeech.stream("rachel", { text, modelId: "eleven_turbo_v2_5", outputFormat: "mp3_44100_128" }); for await (const chunk of audio) globalThis.replyTo(Buffer.from(chunk)); } ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Universal-3 Pro exposes endOfTurnConfidenceThreshold (0-1) and minEndOfTurnSilenceWhenConfident (ms). For chatty callers drop confidence to 0.55 and silence to 150ms; for elderly callers raise to 0.85 / 400ms.
If you need tools, swap GPT-4o for the AssemblyAI build-voice-agent-function-calling reference impl that handles tool calls inline with the same STT.
format_turns: Off by default; turn it on for capitalized, punctuated turns.CallSphere uses AssemblyAI Universal-3 across 37 agents · 90+ tools · 115+ DB tables · 6 verticals, hitting ~310ms STT p50 in production. $149/$499/$1,499 · 14-day trial · 22% affiliate.
Pricing? $0.15/hr streaming as of mid-2026 — cheaper than Deepgram Nova-3 at scale.
Languages? English-first with strong Spanish, French, German, Portuguese; for 60+ languages use Soniox v4.
Diarization? Yes via speakers_expected and speaker_labels post-call; live diarization is in beta.
LiveKit/Pipecat plugins? Both ship first-class AssemblyAI plugins — same model, less code.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How to voice text in 2026: best apps, the API stack behind them, and how I use the same tech inside CallSphere's 57+ language voice agents.
The voice AI market hits $47.5B by 2034. For gyms and PT studios, voice agents now make economic sense for member intake, upsells, and reactivation campaigns.
With the voice AI market at $47.5B by 2034 and OpenAI's realtime release this week, every dealership and service shop should be evaluating voice agents. Here's how.
Spring 2026 AC season starts now. With the voice AI market at $47.5B by 2034, HVAC shops without after-hours voice agents will lose to those that have them.
OpenAI's GPT-Realtime-Whisper launches at $0.017/min for streaming STT. Side-by-side latency, accuracy, and cost math vs Deepgram and the field.
OpenAI's GPT-Realtime-Translate handles 70 input languages live at $0.034/min. Here is what that means for multilingual restaurant takeout — and how CallSphere ships it.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI