---
title: "Build a Voice Agent with AssemblyAI Universal Streaming (2026)"
description: "AssemblyAI Universal-3 Pro Streaming returns immutable transcripts in ~300ms. Build a raw-WebSocket voice agent — Node.js code, endpointing, pitfalls."
canonical: https://callsphere.ai/blog/vw9h-build-voice-agent-assemblyai-universal-streaming-2026
category: "AI Voice Agents"
tags: ["AssemblyAI", "Universal Streaming", "Voice Agent", "Node.js", "STT"]
author: "CallSphere Team"
published: 2026-03-31T00:00:00.000Z
updated: 2026-05-08T03:13:54.006Z
---

# Build a Voice Agent with AssemblyAI Universal Streaming (2026)

> AssemblyAI Universal-3 Pro Streaming returns immutable transcripts in ~300ms. Build a raw-WebSocket voice agent — Node.js code, endpointing, pitfalls.

> **TL;DR** — AssemblyAI's Universal-3 Pro Streaming is purpose-built for voice agents in 2026: 307ms p50 latency, immutable transcripts (no flicker), intelligent endpointing, and unlimited concurrency. The simplest possible voice agent is just three WebSockets — no framework needed.

## What you'll build

A 100-line Node.js server that streams browser mic audio to AssemblyAI, sends finalized transcripts to GPT-4o, and pipes the response into ElevenLabs streaming TTS — all over plain WebSockets.

## Architecture

```mermaid
flowchart LR
  BR[Browser mic] -- WS PCM16 16k --> SV[Node server]
  SV -- WS --> AA[AssemblyAI Universal-3]
  AA -- final transcript --> SV
  SV --> OA[OpenAI GPT-4o stream]
  OA -- text deltas --> SV --> EL[ElevenLabs WS TTS]
  EL -- audio --> SV --> BR
```

## Step 1 — Install

```bash
npm i ws assemblyai openai @elevenlabs/elevenlabs-js
```

## Step 2 — STT WebSocket

```ts
import { AssemblyAI } from "assemblyai";

const aai = new AssemblyAI({ apiKey: process.env.AAI_KEY! });
const stt = aai.streaming.transcriber({
  sampleRate: 16000,
  formatTurns: true,
  endOfTurnConfidenceThreshold: 0.7,
  minEndOfTurnSilenceWhenConfident: 200,
});
stt.on("turn", async (turn) => {
  if (!turn.end_of_turn) return;
  await onUserTurn(turn.transcript);
});
await stt.connect();
```

## Step 3 — Browser mic capture

```ts
// client.ts
const ctx = new AudioContext({ sampleRate: 16000 });
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
await ctx.audioWorklet.addModule("/pcm-worklet.js");
const node = new AudioWorkletNode(ctx, "pcm");
ctx.createMediaStreamSource(stream).connect(node);
const ws = new WebSocket("ws://localhost:8080");
node.port.onmessage = (e) => ws.readyState === 1 && ws.send(e.data);
```

## Step 4 — Server bridge

```ts
import { WebSocketServer } from "ws";
const wss = new WebSocketServer({ port: 8080 });
wss.on("connection", (client) => {
  client.on("message", (chunk) => stt.sendAudio(new Uint8Array(chunk as Buffer)));
  globalThis.replyTo = (audio: Buffer) => client.send(audio);
});
```

## Step 5 — LLM + TTS

```ts
import OpenAI from "openai";
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

const oa = new OpenAI();
const eleven = new ElevenLabsClient();
const history: any[] = [{ role: "system", content: "You are a concierge." }];

async function onUserTurn(text: string) {
  history.push({ role: "user", content: text });
  const stream = await oa.chat.completions.create({
    model: "gpt-4o", messages: history, stream: true,
  });
  let buffer = "";
  for await (const c of stream) {
    const delta = c.choices[0]?.delta?.content ?? "";
    buffer += delta;
    if (/[.!?]\s/.test(buffer)) {  // sentence boundary → speak
      await speak(buffer);
      buffer = "";
    }
  }
  if (buffer) await speak(buffer);
}

async function speak(text: string) {
  const audio = await eleven.textToSpeech.stream("rachel",
    { text, modelId: "eleven_turbo_v2_5", outputFormat: "mp3_44100_128" });
  for await (const chunk of audio) globalThis.replyTo(Buffer.from(chunk));
}
```

## Step 6 — Tune endpointing

Universal-3 Pro exposes `endOfTurnConfidenceThreshold` (0-1) and `minEndOfTurnSilenceWhenConfident` (ms). For chatty callers drop confidence to 0.55 and silence to 150ms; for elderly callers raise to 0.85 / 400ms.

## Step 7 — Function calling fallback

If you need tools, swap GPT-4o for the AssemblyAI `build-voice-agent-function-calling` reference impl that handles tool calls inline with the same STT.

## Pitfalls

- **PCM format**: Universal-3 Streaming wants 16kHz mono PCM16 little-endian — anything else returns garbage transcripts.
- **`format_turns`**: Off by default; turn it on for capitalized, punctuated turns.
- **Concurrency**: Truly unlimited but billing aggregates per minute — rate-limit on YOUR side to avoid surprise invoices.
- **Region**: Default is us-east; pin EU for GDPR data residency.

## How CallSphere does this

CallSphere uses AssemblyAI Universal-3 across **37 agents · 90+ tools · 115+ DB tables · 6 verticals**, hitting ~310ms STT p50 in production. **$149/$499/$1,499 · 14-day trial · 22% affiliate**.

## FAQ

**Pricing?** $0.15/hr streaming as of mid-2026 — cheaper than Deepgram Nova-3 at scale.

**Languages?** English-first with strong Spanish, French, German, Portuguese; for 60+ languages use Soniox v4.

**Diarization?** Yes via `speakers_expected` and `speaker_labels` post-call; live diarization is in beta.

**LiveKit/Pipecat plugins?** Both ship first-class AssemblyAI plugins — same model, less code.

## Sources

- AssemblyAI Blog - Raw WebSocket Voice Agent - [https://www.assemblyai.com/blog/raw-websocket-voice-agent-with-assemblyai-universal-3-pro-streaming](https://www.assemblyai.com/blog/raw-websocket-voice-agent-with-assemblyai-universal-3-pro-streaming)
- AssemblyAI Blog - Introducing Universal-Streaming - [https://www.assemblyai.com/blog/introducing-universal-streaming](https://www.assemblyai.com/blog/introducing-universal-streaming)
- AssemblyAI Blog - Voice Agent with Function Calling - [https://www.assemblyai.com/blog/build-voice-agent-function-calling](https://www.assemblyai.com/blog/build-voice-agent-function-calling)
- AssemblyAI Blog - Phone-Based Voice Agent 2026 Guide - [https://www.assemblyai.com/blog/how-to-create-phone-based-voice-agent](https://www.assemblyai.com/blog/how-to-create-phone-based-voice-agent)

---

Source: https://callsphere.ai/blog/vw9h-build-voice-agent-assemblyai-universal-streaming-2026