Build a Voice Agent with AssemblyAI Universal Streaming (2026)
AssemblyAI Universal-3 Pro Streaming returns immutable transcripts in ~300ms. Build a raw-WebSocket voice agent — Node.js code, endpointing, pitfalls.
TL;DR — AssemblyAI's Universal-3 Pro Streaming is purpose-built for voice agents in 2026: 307ms p50 latency, immutable transcripts (no flicker), intelligent endpointing, and unlimited concurrency. The simplest possible voice agent is just three WebSockets — no framework needed.
What you'll build
A 100-line Node.js server that streams browser mic audio to AssemblyAI, sends finalized transcripts to GPT-4o, and pipes the response into ElevenLabs streaming TTS — all over plain WebSockets.
Architecture
flowchart LR
BR[Browser mic] -- WS PCM16 16k --> SV[Node server]
SV -- WS --> AA[AssemblyAI Universal-3]
AA -- final transcript --> SV
SV --> OA[OpenAI GPT-4o stream]
OA -- text deltas --> SV --> EL[ElevenLabs WS TTS]
EL -- audio --> SV --> BR
Step 1 — Install
```bash npm i ws assemblyai openai @elevenlabs/elevenlabs-js ```
Step 2 — STT WebSocket
```ts import { AssemblyAI } from "assemblyai";
const aai = new AssemblyAI({ apiKey: process.env.AAI_KEY! }); const stt = aai.streaming.transcriber({ sampleRate: 16000, formatTurns: true, endOfTurnConfidenceThreshold: 0.7, minEndOfTurnSilenceWhenConfident: 200, }); stt.on("turn", async (turn) => { if (!turn.end_of_turn) return; await onUserTurn(turn.transcript); }); await stt.connect(); ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — Browser mic capture
```ts // client.ts const ctx = new AudioContext({ sampleRate: 16000 }); const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); await ctx.audioWorklet.addModule("/pcm-worklet.js"); const node = new AudioWorkletNode(ctx, "pcm"); ctx.createMediaStreamSource(stream).connect(node); const ws = new WebSocket("ws://localhost:8080"); node.port.onmessage = (e) => ws.readyState === 1 && ws.send(e.data); ```
Step 4 — Server bridge
```ts import { WebSocketServer } from "ws"; const wss = new WebSocketServer({ port: 8080 }); wss.on("connection", (client) => { client.on("message", (chunk) => stt.sendAudio(new Uint8Array(chunk as Buffer))); globalThis.replyTo = (audio: Buffer) => client.send(audio); }); ```
Step 5 — LLM + TTS
```ts import OpenAI from "openai"; import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
const oa = new OpenAI(); const eleven = new ElevenLabsClient(); const history: any[] = [{ role: "system", content: "You are a concierge." }];
async function onUserTurn(text: string) { history.push({ role: "user", content: text }); const stream = await oa.chat.completions.create({ model: "gpt-4o", messages: history, stream: true, }); let buffer = ""; for await (const c of stream) { const delta = c.choices[0]?.delta?.content ?? ""; buffer += delta; if (/[.!?]\s/.test(buffer)) { // sentence boundary → speak await speak(buffer); buffer = ""; } } if (buffer) await speak(buffer); }
async function speak(text: string) { const audio = await eleven.textToSpeech.stream("rachel", { text, modelId: "eleven_turbo_v2_5", outputFormat: "mp3_44100_128" }); for await (const chunk of audio) globalThis.replyTo(Buffer.from(chunk)); } ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Tune endpointing
Universal-3 Pro exposes endOfTurnConfidenceThreshold (0-1) and minEndOfTurnSilenceWhenConfident (ms). For chatty callers drop confidence to 0.55 and silence to 150ms; for elderly callers raise to 0.85 / 400ms.
Step 7 — Function calling fallback
If you need tools, swap GPT-4o for the AssemblyAI build-voice-agent-function-calling reference impl that handles tool calls inline with the same STT.
Pitfalls
- PCM format: Universal-3 Streaming wants 16kHz mono PCM16 little-endian — anything else returns garbage transcripts.
format_turns: Off by default; turn it on for capitalized, punctuated turns.- Concurrency: Truly unlimited but billing aggregates per minute — rate-limit on YOUR side to avoid surprise invoices.
- Region: Default is us-east; pin EU for GDPR data residency.
How CallSphere does this
CallSphere uses AssemblyAI Universal-3 across 37 agents · 90+ tools · 115+ DB tables · 6 verticals, hitting ~310ms STT p50 in production. $149/$499/$1,499 · 14-day trial · 22% affiliate.
FAQ
Pricing? $0.15/hr streaming as of mid-2026 — cheaper than Deepgram Nova-3 at scale.
Languages? English-first with strong Spanish, French, German, Portuguese; for 60+ languages use Soniox v4.
Diarization? Yes via speakers_expected and speaker_labels post-call; live diarization is in beta.
LiveKit/Pipecat plugins? Both ship first-class AssemblyAI plugins — same model, less code.
Sources
- AssemblyAI Blog - Raw WebSocket Voice Agent - https://www.assemblyai.com/blog/raw-websocket-voice-agent-with-assemblyai-universal-3-pro-streaming
- AssemblyAI Blog - Introducing Universal-Streaming - https://www.assemblyai.com/blog/introducing-universal-streaming
- AssemblyAI Blog - Voice Agent with Function Calling - https://www.assemblyai.com/blog/build-voice-agent-function-calling
- AssemblyAI Blog - Phone-Based Voice Agent 2026 Guide - https://www.assemblyai.com/blog/how-to-create-phone-based-voice-agent
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.