---
title: "Build a Voice Agent on Cloudflare Calls + Workers AI (2026)"
description: "Build a voice agent that runs entirely on Cloudflare's edge: Calls SFU for WebRTC, withVoice mixin for STT/TTS, Workers AI for inference. No external infra, sub-300ms hops."
canonical: https://callsphere.ai/blog/vw5h-build-voice-agent-cloudflare-calls-workers-ai
category: "AI Voice Agents"
tags: ["Cloudflare", "Workers AI", "Calls", "Edge", "Tutorial"]
author: "CallSphere Team"
published: 2026-04-05T00:00:00.000Z
updated: 2026-05-07T16:30:07.570Z
---

# Build a Voice Agent on Cloudflare Calls + Workers AI (2026)

> Build a voice agent that runs entirely on Cloudflare's edge: Calls SFU for WebRTC, withVoice mixin for STT/TTS, Workers AI for inference. No external infra, sub-300ms hops.

> **TL;DR** — Cloudflare's Agents SDK now ships `withVoice`, a mixin that adds STT (Deepgram Nova/Flux), sentence chunking, TTS (Aura), and conversation persistence to a regular Agent class. Combined with Cloudflare Calls (their WebRTC SFU) and Workers AI, you get an end-to-end voice agent on the same edge network — no external API keys for the happy path.

## What you'll build

A `@cloudflare/voice`-powered Agent deployed to Workers, with audio transported over WebSocket from a browser client. The agent uses Workers AI's Llama 3.3 70B for reasoning, Deepgram Flux for streaming STT, and Aura for TTS — all bound natively. Optional: hand the WebSocket to Cloudflare Calls for browser-to-browser group voice rooms with the AI as a participant.

## Prerequisites

1. Cloudflare account with Workers paid plan (Workers AI requires it).
2. `wrangler@4` CLI authenticated.
3. Node 20 + Vite + React for the client scaffold.
4. (Optional) Cloudflare Calls app for SFU multi-participant rooms.

## Architecture

```mermaid
flowchart LR
  B[Browser React] -->|wss| AGT[Worker Agent withVoice]
  AGT -->|STT| DG[Deepgram Flux Workers AI]
  AGT -->|LLM| WA[Workers AI Llama 3.3 70B]
  AGT -->|TTS| AU[Deepgram Aura]
  AGT --> B
  B |WebRTC SFU| CALLS[Cloudflare Calls]
```

## Step 1 — Scaffold the project

```bash
npm create cloudflare@latest voice-agent -- --type=workers-ai
cd voice-agent
npm install @cloudflare/voice agents
```

## Step 2 — Define the Agent

```ts
// src/agent.ts
import { Agent } from "agents";
import { withVoice } from "@cloudflare/voice";

export class Receptionist extends withVoice(Agent) {
  async onChatMessage(message: string) {
    const reply = await this.env.AI.run("@cf/meta/llama-3.3-70b-instruct-fp8-fast", {
      messages: [
        { role: "system", content: "You are a friendly receptionist. Keep replies short." },
        { role: "user", content: message }
      ]
    });
    return reply.response;
  }
}
```

The `withVoice` mixin handles audio in/out automatically; you only implement `onChatMessage(text)`.

## Step 3 — Wrangler config

```toml
name = "voice-agent"
main = "src/index.ts"
compatibility_date = "2026-05-01"

[ai]
binding = "AI"

[[durable_objects.bindings]]
name = "RECEPTIONIST"
class_name = "Receptionist"

[[migrations]]
tag = "v1"
new_sqlite_classes = ["Receptionist"]
```

## Step 4 — Worker entry

```ts
import { Receptionist } from "./agent";
export { Receptionist };
export default {
  async fetch(req: Request, env: Env) {
    const url = new URL(req.url);
    if (url.pathname === "/voice") {
      const id = env.RECEPTIONIST.idFromName(url.searchParams.get("session") ?? crypto.randomUUID());
      return env.RECEPTIONIST.get(id).fetch(req);
    }
    return new Response("ok");
  }
};
```

## Step 5 — React client with browser audio

```tsx
import { VoiceClient } from "@cloudflare/voice/client";
const client = new VoiceClient({
  url: `wss://voice-agent.you.workers.dev/voice?session=${crypto.randomUUID()}`,
  inputSampleRate: 16000,
  outputSampleRate: 24000,
});
await client.start();
```

The client handles `getUserMedia`, AudioWorklet capture, WS framing, and Web Audio playback.

## Step 6 — Configure voice options

In `Receptionist.constructor`:

```ts
super(state, env, {
  voice: {
    stt: { provider: "deepgram-flux", model: "flux", language: "en" },
    tts: { provider: "deepgram-aura", voice: "aura-asteria-en" },
    vad: { silenceThresholdMs: 500 }
  }
});
```

## Step 7 — Optional: Cloudflare Calls for multi-party rooms

If you want the AI to join a multi-party WebRTC room (caller + AI + supervisor), use Cloudflare Calls' SFU API to publish/subscribe tracks. The agent's `withVoice` audio output becomes a track published into the SFU room; humans subscribe via standard WebRTC.

`fetch("https://rtc.live.cloudflare.com/v1/apps/{appId}/sessions/new", { method: "POST", headers: { Authorization: \`Bearer ${env.CALLS_TOKEN}` } })`

## Pitfalls

- **Workers CPU limit** is 30s default, 5min on paid. Voice agents need long-lived sockets — the `withVoice` mixin runs inside a Durable Object so this isn't a problem, but plain Workers wouldn't work.
- **Workers AI cold starts** for 70B models can be 1-2s; warm with a `Promise.race` against a smaller model (8B) for first-token TTS.
- **Deepgram Flux** is the streaming choice; `Nova` is batch and adds 200ms.
- **Free tier** Workers AI has tight limits — voice agents will burn through them in minutes.
- **Audio format**: `withVoice` expects PCM16 mono; resample on the client if your mic capture differs.

## How CallSphere does this in production

CallSphere doesn't deploy the voice path on Cloudflare Workers because our HIPAA Healthcare vertical needs Postgres-resident audit logs that fit our 115+ table schema. We do use Cloudflare in front of FastAPI :8084 as a CDN + DDoS layer, and we've experimented with Workers AI for the OneRoof multi-family vertical's chat fallback. 37 voice agents, 90+ tools, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate.

## FAQ

**Q: How do I add my own STT/TTS provider?**
`withVoice` accepts `stt: { provider: "custom", run: async (pcm) => string }`. Plug in OpenAI Whisper, Azure, anything.

**Q: Can I use Workers AI alone without Deepgram?**
Yes — Workers AI ships `@cf/openai/whisper-large-v3-turbo` for STT and `@cf/myshell-ai/melotts` for TTS. Quality is lower than Deepgram on Aura but free under the binding.

**Q: Latency target?**
~500-700ms voice-to-voice on Cloudflare's network because everything is colocated. The win vs other clouds is no inter-service hops.

**Q: PSTN?**
Twilio Media Streams as a SIP frontend; the Worker handles the Media Stream WS and translates frames to `withVoice`'s expected format.

**Q: Cost?**
Workers AI Llama 3.3 70B is $0.4 per 1M input tokens, $0.6 per 1M output. Deepgram Flux STT is included in the `@cloudflare/voice` binding pricing — call it $0.005/min STT + $0.012/min TTS + $0.02/min LLM = ~$0.04/min all-in.

## Sources

- [Build a voice agent — Cloudflare Agents docs](https://developers.cloudflare.com/agents/guides/build-a-voice-agent/)
- [Voice agents API reference — Cloudflare](https://developers.cloudflare.com/agents/api-reference/voice/)
- [cloudflare/agents on GitHub](https://github.com/cloudflare/agents)
- [Cloudflare is the best place to build realtime voice agents — Cloudflare Blog](https://blog.cloudflare.com/cloudflare-realtime-voice-ai/)
- [Add voice to your agent — Cloudflare Blog](https://blog.cloudflare.com/voice-agents/)

---

Source: https://callsphere.ai/blog/vw5h-build-voice-agent-cloudflare-calls-workers-ai