By Sagar Shankaran, Founder of CallSphere
Run STT, LLM, and TTS entirely on Cloudflare's edge — no OpenAI, no ElevenLabs. Real working code with Whisper, Llama 3.3 70B, and Deepgram Aura.
Key takeaways
TL;DR — Cloudflare Workers AI ships Whisper, Llama 3.3 70B, and Deepgram Aura behind one
AIbinding. Build a voice agent with zero external API keys, zero per-token surprise bills, and global edge co-location for free.
A Worker that takes a WebSocket of PCM16 audio frames, transcribes via @cf/openai/whisper-large-v3-turbo, generates a reply via @cf/meta/llama-3.3-70b-instruct, synthesizes via @cf/deepgram/aura-1, and streams audio back. End-to-end on the Cloudflare edge.
wrangler 4+.npm i agents (the Cloudflare Agents SDK).AudioWorklet.flowchart LR
B[Browser PCM16] -- ws --> W[Worker]
W -- AI binding --> ST[@cf Whisper]
W -- AI binding --> LL[@cf Llama 3.3 70B]
W -- AI binding --> TT[@cf Aura]
W -- ws --> B
wrangler.jsonc```jsonc { "name": "callsphere-cf-only", "main": "src/index.ts", "compatibility_date": "2026-05-01", "compatibility_flags": ["nodejs_compat"], "ai": { "binding": "AI" } } ```
```typescript type Env = { AI: Ai };
export default {
async fetch(req: Request, env: Env): Promise
const pair = new WebSocketPair();
const [client, server] = Object.values(pair) as [WebSocket, WebSocket];
server.accept();
handle(server, env);
return new Response(null, { status: 101, webSocket: client });
}, }; ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Workers AI Whisper accepts an audio array (Uint8 of WAV/Opus/raw):
```typescript async function handle(ws: WebSocket, env: Env) { const buffer: number[] = []; let history: { role: string; content: string }[] = [ { role: "system", content: "You are CallSphere on Cloudflare. Reply in 1-2 sentences." }, ];
ws.addEventListener("message", async (e) => { if (typeof e.data === "string") { if (e.data === "flush") await transcribeAndReply(ws, env, buffer, history); return; } const u8 = new Uint8Array(e.data as ArrayBuffer); for (const b of u8) buffer.push(b); }); } ```
```typescript async function transcribeAndReply( ws: WebSocket, env: Env, buffer: number[], history: { role: string; content: string }[] ) { const audio = Array.from(buffer); buffer.length = 0; const stt = await env.AI.run("@cf/openai/whisper-large-v3-turbo", { audio }); const text = (stt as any).text as string; if (!text || text.length < 2) return;
history.push({ role: "user", content: text }); ws.send(JSON.stringify({ type: "transcript", role: "user", text })); ```
```typescript const llm = await env.AI.run("@cf/meta/llama-3.3-70b-instruct", { messages: history, max_tokens: 200, }); const reply = (llm as any).response as string; history.push({ role: "assistant", content: reply }); ws.send(JSON.stringify({ type: "transcript", role: "assistant", text: reply })); ```
```typescript
const tts = await env.AI.run("@cf/deepgram/aura-1", {
text: reply,
speaker: "asteria-en",
encoding: "linear16",
sample_rate: 16000,
});
// tts is a ReadableStream
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```html
```
Array.from.AudioContext rate (16k or 24k).AI.run; CPU is fine.flush.We use Cloudflare Workers AI for /llms-full.txt rendering and lightweight FAQ agents on landing pages — see /lp/healthcare and /lp/salon. For full call routing, our 24/7 voice plane stays on dedicated GPUs (37 agents, 6 verticals, 90+ tools, HIPAA + SOC 2). Pricing on /pricing; 14-day trial; 22% affiliate.
Cost? Workers AI is per-neuron; ~$0.003 per voice round-trip (Whisper + Llama + Aura).
Quality vs OpenAI? Llama 3.3 70B holds its own for short replies; long agentic chains favor GPT-4o.
Latency? ~700–900ms end-to-end on the same colo.
Can I add my own model? Yes — @cf/custom/... via Workers AI Custom Models.
Persistence? Pair with Durable Objects (see post #8) for chat history.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
OpenAI's GPT-Realtime-Whisper launches at $0.017/min for streaming STT. Side-by-side latency, accuracy, and cost math vs Deepgram and the field.
Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.
How Llama Guard 4 compares to OpenAI's Moderation API on accuracy, latency, and cost — for both open and closed model deployments. Practical context for teams in Seattle, WA.
Llama Guard 4 ships as Meta's safety classifier for the Llama 4 era — input/output classification with multimodal support. Lens: e-commerce.
Version your prompts in git, run a 50-case eval suite on every PR, block merges below threshold, and ship a new agent prompt with confidence — full GitHub Actions tutorial.
Replace expensive outbound SDR tooling with a self-hosted dialer that runs OpenAI Realtime agents at 100 concurrent calls. Full architecture and code.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI