Skip to content
AI Engineering
AI Engineering12 min read0 views

Build a Voice Agent on Cloudflare Workers AI (No External LLM)

Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.

TL;DR — Cloudflare Workers AI ships Whisper, Llama 3.3 70B, and Deepgram Aura behind one AI binding. Build a voice agent with zero external API keys, zero per-token surprise bills, and global edge co-location for free.

What you'll build

A Worker that takes a WebSocket of PCM16 audio frames, transcribes via @cf/openai/whisper-large-v3-turbo, generates a reply via @cf/meta/llama-3.3-70b-instruct, synthesizes via @cf/deepgram/aura-1, and streams audio back. End-to-end on the Cloudflare edge.

Prerequisites

  1. Cloudflare account with Workers Paid ($5/mo) and Workers AI access.
  2. wrangler 4+.
  3. npm i agents (the Cloudflare Agents SDK).
  4. A static client that records 16kHz PCM via AudioWorklet.
  5. Familiarity with TypeScript.

Architecture

flowchart LR
  B[Browser PCM16] -- ws --> W[Worker]
  W -- AI binding --> ST[@cf Whisper]
  W -- AI binding --> LL[@cf Llama 3.3 70B]
  W -- AI binding --> TT[@cf Aura]
  W -- ws --> B

Step 1 — wrangler.jsonc

```jsonc { "name": "callsphere-cf-only", "main": "src/index.ts", "compatibility_date": "2026-05-01", "compatibility_flags": ["nodejs_compat"], "ai": { "binding": "AI" } } ```

Step 2 — Worker that upgrades to WebSocket

```typescript type Env = { AI: Ai };

export default { async fetch(req: Request, env: Env): Promise { const url = new URL(req.url); if (url.pathname !== "/voice") return new Response("nf", { status: 404 }); const upgrade = req.headers.get("Upgrade"); if (upgrade !== "websocket") return new Response("ws only", { status: 400 });

const pair = new WebSocketPair();
const [client, server] = Object.values(pair) as [WebSocket, WebSocket];
server.accept();
handle(server, env);
return new Response(null, { status: 101, webSocket: client });

}, }; ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Receive audio + run Whisper

Workers AI Whisper accepts an audio array (Uint8 of WAV/Opus/raw):

```typescript async function handle(ws: WebSocket, env: Env) { const buffer: number[] = []; let history: { role: string; content: string }[] = [ { role: "system", content: "You are CallSphere on Cloudflare. Reply in 1-2 sentences." }, ];

ws.addEventListener("message", async (e) => { if (typeof e.data === "string") { if (e.data === "flush") await transcribeAndReply(ws, env, buffer, history); return; } const u8 = new Uint8Array(e.data as ArrayBuffer); for (const b of u8) buffer.push(b); }); } ```

```typescript async function transcribeAndReply( ws: WebSocket, env: Env, buffer: number[], history: { role: string; content: string }[] ) { const audio = Array.from(buffer); buffer.length = 0; const stt = await env.AI.run("@cf/openai/whisper-large-v3-turbo", { audio }); const text = (stt as any).text as string; if (!text || text.length < 2) return;

history.push({ role: "user", content: text }); ws.send(JSON.stringify({ type: "transcript", role: "user", text })); ```

Step 4 — LLM with Llama 3.3 70B

```typescript const llm = await env.AI.run("@cf/meta/llama-3.3-70b-instruct", { messages: history, max_tokens: 200, }); const reply = (llm as any).response as string; history.push({ role: "assistant", content: reply }); ws.send(JSON.stringify({ type: "transcript", role: "assistant", text: reply })); ```

Step 5 — TTS with Aura, stream chunks back

```typescript const tts = await env.AI.run("@cf/deepgram/aura-1", { text: reply, speaker: "asteria-en", encoding: "linear16", sample_rate: 16000, }); // tts is a ReadableStream const reader = (tts as ReadableStream).getReader(); for (;;) { const { value, done } = await reader.read(); if (done) break; ws.send(value); } } ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Browser client (16kHz mic, AudioWorklet)

```html

```

Common pitfalls

  • Whisper expects an array, not a typed arrayArray.from.
  • Aura sample-rate mismatch — match client AudioContext rate (16k or 24k).
  • Worker CPU cap — large LLM calls run as async AI.run; CPU is fine.
  • Audio buffer leaks across sessions — reset on each flush.

How CallSphere does this in production

We use Cloudflare Workers AI for /llms-full.txt rendering and lightweight FAQ agents on landing pages — see /lp/healthcare and /lp/salon. For full call routing, our 24/7 voice plane stays on dedicated GPUs (37 agents, 6 verticals, 90+ tools, HIPAA + SOC 2). Pricing on /pricing; 14-day trial; 22% affiliate.

FAQ

Cost? Workers AI is per-neuron; ~$0.003 per voice round-trip (Whisper + Llama + Aura).

Quality vs OpenAI? Llama 3.3 70B holds its own for short replies; long agentic chains favor GPT-4o.

Latency? ~700–900ms end-to-end on the same colo.

Can I add my own model? Yes — @cf/custom/... via Workers AI Custom Models.

Persistence? Pair with Durable Objects (see post #8) for chat history.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.