Build a Voice Agent on Cloudflare Workers AI (No External LLM)
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
TL;DR — Cloudflare Workers AI ships Whisper, Llama 3.3 70B, and Deepgram Aura behind one
AIbinding. Build a voice agent with zero external API keys, zero per-token surprise bills, and global edge co-location for free.
What you'll build
A Worker that takes a WebSocket of PCM16 audio frames, transcribes via @cf/openai/whisper-large-v3-turbo, generates a reply via @cf/meta/llama-3.3-70b-instruct, synthesizes via @cf/deepgram/aura-1, and streams audio back. End-to-end on the Cloudflare edge.
Prerequisites
- Cloudflare account with Workers Paid ($5/mo) and Workers AI access.
wrangler 4+.npm i agents(the Cloudflare Agents SDK).- A static client that records 16kHz PCM via
AudioWorklet. - Familiarity with TypeScript.
Architecture
flowchart LR
B[Browser PCM16] -- ws --> W[Worker]
W -- AI binding --> ST[@cf Whisper]
W -- AI binding --> LL[@cf Llama 3.3 70B]
W -- AI binding --> TT[@cf Aura]
W -- ws --> B
Step 1 — wrangler.jsonc
```jsonc { "name": "callsphere-cf-only", "main": "src/index.ts", "compatibility_date": "2026-05-01", "compatibility_flags": ["nodejs_compat"], "ai": { "binding": "AI" } } ```
Step 2 — Worker that upgrades to WebSocket
```typescript type Env = { AI: Ai };
export default {
async fetch(req: Request, env: Env): Promise
const pair = new WebSocketPair();
const [client, server] = Object.values(pair) as [WebSocket, WebSocket];
server.accept();
handle(server, env);
return new Response(null, { status: 101, webSocket: client });
}, }; ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — Receive audio + run Whisper
Workers AI Whisper accepts an audio array (Uint8 of WAV/Opus/raw):
```typescript async function handle(ws: WebSocket, env: Env) { const buffer: number[] = []; let history: { role: string; content: string }[] = [ { role: "system", content: "You are CallSphere on Cloudflare. Reply in 1-2 sentences." }, ];
ws.addEventListener("message", async (e) => { if (typeof e.data === "string") { if (e.data === "flush") await transcribeAndReply(ws, env, buffer, history); return; } const u8 = new Uint8Array(e.data as ArrayBuffer); for (const b of u8) buffer.push(b); }); } ```
```typescript async function transcribeAndReply( ws: WebSocket, env: Env, buffer: number[], history: { role: string; content: string }[] ) { const audio = Array.from(buffer); buffer.length = 0; const stt = await env.AI.run("@cf/openai/whisper-large-v3-turbo", { audio }); const text = (stt as any).text as string; if (!text || text.length < 2) return;
history.push({ role: "user", content: text }); ws.send(JSON.stringify({ type: "transcript", role: "user", text })); ```
Step 4 — LLM with Llama 3.3 70B
```typescript const llm = await env.AI.run("@cf/meta/llama-3.3-70b-instruct", { messages: history, max_tokens: 200, }); const reply = (llm as any).response as string; history.push({ role: "assistant", content: reply }); ws.send(JSON.stringify({ type: "transcript", role: "assistant", text: reply })); ```
Step 5 — TTS with Aura, stream chunks back
```typescript
const tts = await env.AI.run("@cf/deepgram/aura-1", {
text: reply,
speaker: "asteria-en",
encoding: "linear16",
sample_rate: 16000,
});
// tts is a ReadableStream
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Browser client (16kHz mic, AudioWorklet)
```html
```
Common pitfalls
- Whisper expects an array, not a typed array —
Array.from. - Aura sample-rate mismatch — match client
AudioContextrate (16k or 24k). - Worker CPU cap — large LLM calls run as async
AI.run; CPU is fine. - Audio buffer leaks across sessions — reset on each
flush.
How CallSphere does this in production
We use Cloudflare Workers AI for /llms-full.txt rendering and lightweight FAQ agents on landing pages — see /lp/healthcare and /lp/salon. For full call routing, our 24/7 voice plane stays on dedicated GPUs (37 agents, 6 verticals, 90+ tools, HIPAA + SOC 2). Pricing on /pricing; 14-day trial; 22% affiliate.
FAQ
Cost? Workers AI is per-neuron; ~$0.003 per voice round-trip (Whisper + Llama + Aura).
Quality vs OpenAI? Llama 3.3 70B holds its own for short replies; long agentic chains favor GPT-4o.
Latency? ~700–900ms end-to-end on the same colo.
Can I add my own model? Yes — @cf/custom/... via Workers AI Custom Models.
Persistence? Pair with Durable Objects (see post #8) for chat history.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.