TL;DR — AssemblyAI's Universal-3 Pro Streaming is purpose-built for voice agents in 2026: 307ms p50 latency, immutable transcripts (no flicker), intelligent endpointing, and unlimited concurrency. The simplest possible voice agent is just three WebSockets — no framework needed.

What you'll build

A 100-line Node.js server that streams browser mic audio to AssemblyAI, sends finalized transcripts to GPT-4o, and pipes the response into ElevenLabs streaming TTS — all over plain WebSockets.

Architecture

flowchart LR
  BR[Browser mic] -- WS PCM16 16k --> SV[Node server]
  SV -- WS --> AA[AssemblyAI Universal-3]
  AA -- final transcript --> SV
  SV --> OA[OpenAI GPT-4o stream]
  OA -- text deltas --> SV --> EL[ElevenLabs WS TTS]
  EL -- audio --> SV --> BR

Step 1 — Install

```bash npm i ws assemblyai openai @elevenlabs/elevenlabs-js ```

Step 2 — STT WebSocket

```ts import { AssemblyAI } from "assemblyai";

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Step 3 — Browser mic capture

```ts // client.ts const ctx = new AudioContext({ sampleRate: 16000 }); const stream = await navigator.mediaDevices.getUserMedia({ audio: true }); await ctx.audioWorklet.addModule("/pcm-worklet.js"); const node = new AudioWorkletNode(ctx, "pcm"); ctx.createMediaStreamSource(stream).connect(node); const ws = new WebSocket("ws://localhost:8080"); node.port.onmessage = (e) => ws.readyState === 1 && ws.send(e.data); ```

Step 4 — Server bridge

```ts import { WebSocketServer } from "ws"; const wss = new WebSocketServer({ port: 8080 }); wss.on("connection", (client) => { client.on("message", (chunk) => stt.sendAudio(new Uint8Array(chunk as Buffer))); globalThis.replyTo = (audio: Buffer) => client.send(audio); }); ```

Step 5 — LLM + TTS

```ts import OpenAI from "openai"; import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

const oa = new OpenAI(); const eleven = new ElevenLabsClient(); const history: any[] = [{ role: "system", content: "You are a concierge." }];

async function onUserTurn(text: string) { history.push({ role: "user", content: text }); const stream = await oa.chat.completions.create({ model: "gpt-4o", messages: history, stream: true, }); let buffer = ""; for await (const c of stream) { const delta = c.choices[0]?.delta?.content ?? ""; buffer += delta; if (/[.!?]\s/.test(buffer)) { // sentence boundary → speak await speak(buffer); buffer = ""; } } if (buffer) await speak(buffer); }

async function speak(text: string) { const audio = await eleven.textToSpeech.stream("rachel", { text, modelId: "eleven_turbo_v2_5", outputFormat: "mp3_44100_128" }); for await (const chunk of audio) globalThis.replyTo(Buffer.from(chunk)); } ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Step 6 — Tune endpointing

Universal-3 Pro exposes endOfTurnConfidenceThreshold (0-1) and minEndOfTurnSilenceWhenConfident (ms). For chatty callers drop confidence to 0.55 and silence to 150ms; for elderly callers raise to 0.85 / 400ms.

Step 7 — Function calling fallback

If you need tools, swap GPT-4o for the AssemblyAI build-voice-agent-function-calling reference impl that handles tool calls inline with the same STT.

Pitfalls

PCM format: Universal-3 Streaming wants 16kHz mono PCM16 little-endian — anything else returns garbage transcripts.
format_turns: Off by default; turn it on for capitalized, punctuated turns.
Concurrency: Truly unlimited but billing aggregates per minute — rate-limit on YOUR side to avoid surprise invoices.
Region: Default is us-east; pin EU for GDPR data residency.

How CallSphere does this

CallSphere uses AssemblyAI Universal-3 across 37 agents · 90+ tools · 115+ DB tables · 6 verticals, hitting ~310ms STT p50 in production. $149/$499/$1,499 · 14-day trial · 22% affiliate.

FAQ

Pricing? $0.15/hr streaming as of mid-2026 — cheaper than Deepgram Nova-3 at scale.

Languages? English-first with strong Spanish, French, German, Portuguese; for 60+ languages use Soniox v4.

Diarization? Yes via speakers_expected and speaker_labels post-call; live diarization is in beta.

LiveKit/Pipecat plugins? Both ship first-class AssemblyAI plugins — same model, less code.

Sources

AssemblyAI Blog - Raw WebSocket Voice Agent - https://www.assemblyai.com/blog/raw-websocket-voice-agent-with-assemblyai-universal-3-pro-streaming
AssemblyAI Blog - Introducing Universal-Streaming - https://www.assemblyai.com/blog/introducing-universal-streaming
AssemblyAI Blog - Voice Agent with Function Calling - https://www.assemblyai.com/blog/build-voice-agent-function-calling
AssemblyAI Blog - Phone-Based Voice Agent 2026 Guide - https://www.assemblyai.com/blog/how-to-create-phone-based-voice-agent

Build a Voice Agent with AssemblyAI Universal Streaming (2026)

What you'll build

Architecture

Step 1 — Install

Step 2 — STT WebSocket

Step 3 — Browser mic capture

Step 4 — Server bridge

Step 5 — LLM + TTS

Step 6 — Tune endpointing

Step 7 — Function calling fallback

Pitfalls

How CallSphere does this

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Real-Time ASR in 2026: Whisper-V4, Deepgram Nova-4, and AssemblyAI Universal-2

Build a Multi-Region Voice Agent on Fly.io for Sub-500ms Global Latency (2026)

Build an AI Voice Agent with SolidStart + SolidJS + OpenAI Realtime (2026)

Bun vs Node WebSocket Benchmarks for AI Agents in 2026

TensorFlow.js + ML5.js Voice Agents in the Browser: 2026 Architecture

Build an AI Voice Agent with Nuxt 3 + Vue 3.5 + OpenAI Realtime (2026)