Skip to content
AI Voice Agents
AI Voice Agents11 min read0 views

Build a Voice Agent with AssemblyAI Universal Streaming (2026)

AssemblyAI Universal-3 Pro Streaming returns immutable transcripts in ~300ms. Build a raw-WebSocket voice agent — Node.js code, endpointing, pitfalls.

TL;DR — AssemblyAI's Universal-3 Pro Streaming is purpose-built for voice agents in 2026: 307ms p50 latency, immutable transcripts (no flicker), intelligent endpointing, and unlimited concurrency. The simplest possible voice agent is just three WebSockets — no framework needed.

What you'll build

A 100-line Node.js server that streams browser mic audio to AssemblyAI, sends finalized transcripts to GPT-4o, and pipes the response into ElevenLabs streaming TTS — all over plain WebSockets.

Architecture

flowchart LR
  BR[Browser mic] -- WS PCM16 16k --> SV[Node server]
  SV -- WS --> AA[AssemblyAI Universal-3]
  AA -- final transcript --> SV
  SV --> OA[OpenAI GPT-4o stream]
  OA -- text deltas --> SV --> EL[ElevenLabs WS TTS]
  EL -- audio --> SV --> BR

Step 1 — Install

```bash npm i ws assemblyai openai @elevenlabs/elevenlabs-js ```

Step 2 — STT WebSocket

```ts import { AssemblyAI } from "assemblyai";

const aai = new AssemblyAI({ apiKey: process.env.AAI_KEY! }); const stt = aai.streaming.transcriber({ sampleRate: 16000, formatTurns: true, endOfTurnConfidenceThreshold: 0.7, minEndOfTurnSilenceWhenConfident: 200, }); stt.on("turn", async (turn) => { if (!turn.end_of_turn) return; await onUserTurn(turn.transcript); }); await stt.connect(); ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Browser mic capture

```ts // client.ts const ctx = new AudioContext({ sampleRate: 16000 }); const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); await ctx.audioWorklet.addModule("/pcm-worklet.js"); const node = new AudioWorkletNode(ctx, "pcm"); ctx.createMediaStreamSource(stream).connect(node); const ws = new WebSocket("ws://localhost:8080"); node.port.onmessage = (e) => ws.readyState === 1 && ws.send(e.data); ```

Step 4 — Server bridge

```ts import { WebSocketServer } from "ws"; const wss = new WebSocketServer({ port: 8080 }); wss.on("connection", (client) => { client.on("message", (chunk) => stt.sendAudio(new Uint8Array(chunk as Buffer))); globalThis.replyTo = (audio: Buffer) => client.send(audio); }); ```

Step 5 — LLM + TTS

```ts import OpenAI from "openai"; import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

const oa = new OpenAI(); const eleven = new ElevenLabsClient(); const history: any[] = [{ role: "system", content: "You are a concierge." }];

async function onUserTurn(text: string) { history.push({ role: "user", content: text }); const stream = await oa.chat.completions.create({ model: "gpt-4o", messages: history, stream: true, }); let buffer = ""; for await (const c of stream) { const delta = c.choices[0]?.delta?.content ?? ""; buffer += delta; if (/[.!?]\s/.test(buffer)) { // sentence boundary → speak await speak(buffer); buffer = ""; } } if (buffer) await speak(buffer); }

async function speak(text: string) { const audio = await eleven.textToSpeech.stream("rachel", { text, modelId: "eleven_turbo_v2_5", outputFormat: "mp3_44100_128" }); for await (const chunk of audio) globalThis.replyTo(Buffer.from(chunk)); } ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Tune endpointing

Universal-3 Pro exposes endOfTurnConfidenceThreshold (0-1) and minEndOfTurnSilenceWhenConfident (ms). For chatty callers drop confidence to 0.55 and silence to 150ms; for elderly callers raise to 0.85 / 400ms.

Step 7 — Function calling fallback

If you need tools, swap GPT-4o for the AssemblyAI build-voice-agent-function-calling reference impl that handles tool calls inline with the same STT.

Pitfalls

  • PCM format: Universal-3 Streaming wants 16kHz mono PCM16 little-endian — anything else returns garbage transcripts.
  • format_turns: Off by default; turn it on for capitalized, punctuated turns.
  • Concurrency: Truly unlimited but billing aggregates per minute — rate-limit on YOUR side to avoid surprise invoices.
  • Region: Default is us-east; pin EU for GDPR data residency.

How CallSphere does this

CallSphere uses AssemblyAI Universal-3 across 37 agents · 90+ tools · 115+ DB tables · 6 verticals, hitting ~310ms STT p50 in production. $149/$499/$1,499 · 14-day trial · 22% affiliate.

FAQ

Pricing? $0.15/hr streaming as of mid-2026 — cheaper than Deepgram Nova-3 at scale.

Languages? English-first with strong Spanish, French, German, Portuguese; for 60+ languages use Soniox v4.

Diarization? Yes via speakers_expected and speaker_labels post-call; live diarization is in beta.

LiveKit/Pipecat plugins? Both ship first-class AssemblyAI plugins — same model, less code.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Voice AI Agents

Real-Time ASR in 2026: Whisper-V4, Deepgram Nova-4, and AssemblyAI Universal-2

The three real-time ASR engines competing for production voice-agent traffic in 2026, benchmarked on accuracy, latency, and cost.

AI Infrastructure

Build a Multi-Region Voice Agent on Fly.io for Sub-500ms Global Latency (2026)

Deploy a voice agent to Fly.io's anycast network across 6 regions: Tokyo, Frankfurt, São Paulo, Sydney, Virginia, Los Angeles. fly-replay routes traffic to the closest healthy region.

AI Voice Agents

Build an AI Voice Agent with SolidStart + SolidJS + OpenAI Realtime (2026)

SolidStart 1.3 + Solid 1.9 deliver fine-grained reactivity with no VDOM — voice agents render at 30% lower CPU than React. Plug WebRTC into Solid signals.

AI Engineering

Bun vs Node WebSocket Benchmarks for AI Agents in 2026

Bun's WebSocket implementation handles 1.2M concurrent connections vs Node's 680K on identical hardware. Where the gap is real, where it isn't, and the production tradeoffs.

AI Infrastructure

TensorFlow.js + ML5.js Voice Agents in the Browser: 2026 Architecture

Pre-trained Speech Commands models, ml5.js wrappers, and TensorFlow.js with the WASM/WebGPU backend let you ship a voice agent with wake-word, intent, and tone detection — all client-side.

AI Voice Agents

Build an AI Voice Agent with Nuxt 3 + Vue 3.5 + OpenAI Realtime (2026)

Nuxt 3 Nitro server routes mint ephemeral OpenAI keys, Vue 3.5 composables wrap WebRTC, and Pinia holds the call state. Sub-700ms voice agent in 200 lines.