---
title: "Build a Voice Agent with Daily Bots: Hosted Pipecat Cloud (2026)"
description: "Daily Bots gives you Pipecat as a managed service — POST a config, get a WebRTC bot. Real curl + RTVI client code, model swaps, and prod pitfalls."
canonical: https://callsphere.ai/blog/vw9h-build-voice-agent-daily-bots-cloud-platform-2026
category: "AI Voice Agents"
tags: ["Daily Bots", "Voice Agent", "Pipecat", "WebRTC", "RTVI"]
author: "CallSphere Team"
published: 2026-03-19T00:00:00.000Z
updated: 2026-05-08T17:25:15.769Z
---

# Build a Voice Agent with Daily Bots: Hosted Pipecat Cloud (2026)

> Daily Bots gives you Pipecat as a managed service — POST a config, get a WebRTC bot. Real curl + RTVI client code, model swaps, and prod pitfalls.

> **TL;DR** — Daily Bots is the hosted version of Pipecat. You POST a JSON config to `/start`, get a Daily room URL back, and connect any browser/iOS/Android client with the RTVI SDK. No bot infrastructure to run yourself.

## What you'll build

A 20-line Node script that spins up a Cartesia-voiced GPT-4o bot in a fresh Daily room, plus a 30-line React client that joins it with mic + speaker — all running through Daily's global SFU.

## Architecture

```mermaid
flowchart LR
  CL[Browser RTVI client] -- WebRTC --> RM[Daily room]
  AP[Your /start endpoint] -- POST /bots/start --> DB[Daily Bots API]
  DB -- spawns --> BOT[Hosted Pipecat bot]
  BOT -- audio --> RM --> CL
```

## Step 1 — Get keys

Sign up at dashboard.daily.co for a Daily Bots account (separate from the Daily video API). Add your OpenAI + Cartesia keys in the dashboard secrets vault — Daily Bots references them by name, not value.

## Step 2 — Server: POST /start

```ts
// app/api/start/route.ts
export async function POST() {
  const r = await fetch("[https://api.daily.co/v1/bots/start](https://api.daily.co/v1/bots/start)", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.DAILY_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      bot_profile: "voice_2024_10",
      max_duration: 600,
      services: { stt: "deepgram", llm: "openai", tts: "cartesia" },
      config: [
        { service: "vad",  options: [{ name: "params", value: { stop_secs: 0.4 } }] },
        { service: "tts",  options: [{ name: "voice", value: "79a125e8-cd45-4c13-8a67-188112f4dd22" }] },
        { service: "llm",  options: [
          { name: "model", value: "gpt-4o" },
          { name: "initial_messages", value: [
            { role: "system", content: "You are a friendly clinic concierge." },
          ] },
          { name: "run_on_config", value: true },
        ]},
      ],
    }),
  });
  return Response.json(await r.json());
}
```

## Step 3 — Client: RTVI React

```tsx
"use client";
import { RTVIClient } from "@pipecat-ai/client-js";
import { DailyTransport } from "@pipecat-ai/daily-transport";

export function VoiceBot() {
  async function connect() {
    const { room_url, token } = await fetch("/api/start", { method: "POST" })
      .then((r) => r.json());
    const client = new RTVIClient({
      transport: new DailyTransport(),
      params: { baseUrl: room_url, token },
      enableMic: true, enableCam: false,
    });
    await client.connect();
  }
  return Talk to bot;
}
```

## Step 4 — Swap the LLM live

POST `/bots//action` with `{"service":"llm","action":"set_model","arguments":[{"name":"model","value":"claude-3-5-sonnet"}]}` and the bot hot-swaps providers mid-session.

## Step 5 — Function calls

Add `tools` to the `llm` config. When the LLM emits a tool call, Daily Bots forwards it to your webhook URL and resumes once you POST the result back.

## Step 6 — Inspect transcripts

Subscribe to the `transcript` RTVI message type on the client to render live captions, or pull the recording + transcript from the Daily REST API after the call.

## Pitfalls

- **Bot profile pin**: Always pin `bot_profile: "voice_2024_10"` (or newer dated tag) — `"voice_latest"` can change overnight and break configs.
- **`run_on_config: true`**: Without it, the bot waits silently until the user speaks first — unfriendly for outbound calls.
- **Region selection**: Pass `{ "geo": "us-east" }` for SIP/PSTN bridging tasks — round-trip latency matters more than for browser-only bots.
- **Concurrency limits**: Default is 5 concurrent bots — file a support ticket before scaling promos.

## How CallSphere does this

CallSphere uses Daily Bots for spike-traffic webinar lines and demo lines while running Pipecat directly on k3s for steady traffic. **37 agents · 90+ tools · 115+ DB tables · 6 verticals · $149/$499/$1,499 · 14-day trial · 22% affiliate**.

## FAQ

**Pricing?** Per-bot-minute, billed against the model providers you choose plus a Daily margin — typically $0.05-0.20/min.

**SIP/PSTN?** Yes, via Daily's Pinless SIP — POST `{"sip": {"display_name": "..."}}` to bridge a phone number into the bot's room.

**Recording?** Set `recording_settings: { type: "cloud" }` — MP4 + transcript appear in your S3 in ~30s.

**Open-source fallback?** Run the same Pipecat config on your own infra — same bot code, just self-hosted.

## Sources

- Daily Bots Docs - Introduction - [https://docs.dailybots.ai/introduction](https://docs.dailybots.ai/introduction)
- Daily Bots Docs - Build Your First Bot - [https://docs.dailybots.ai/tutorial/01-setup](https://docs.dailybots.ai/tutorial/01-setup)
- Daily Bots Docs - Client Code - [https://docs.dailybots.ai/tutorial/04-client-code](https://docs.dailybots.ai/tutorial/04-client-code)
- Daily Bots Demo - [https://demo.dailybots.ai/](https://demo.dailybots.ai/)

## How this plays out in production

Building on the discussion above in *Build a Voice Agent with Daily Bots: Hosted Pipecat Cloud (2026)*, the place this gets non-obvious in production is the latency budget — every leg of the audio loop (capture, ASR, reasoning, TTS, transport) eats into the <1s response window callers expect. Treat this as a voice-first system from the first prompt: the agent's persona, its tool surface, and its escalation rules all flow from that single decision. Teams that ship fast tend to instrument the loop end-to-end before they tune any single component, because the bottleneck is rarely where intuition puts it.

## Voice agent architecture, end to end

A production-grade voice stack at CallSphere stitches Twilio Programmable Voice (PSTN ingress, TwiML, bidirectional Media Streams) to a realtime reasoning layer — typically OpenAI Realtime or ElevenLabs Conversational AI — with sub-second response as a hard SLO. Anything north of one second of perceived silence and callers either repeat themselves or hang up; that single number drives the whole architecture. Server-side VAD with proper barge-in support is non-negotiable, otherwise the agent talks over the caller and the conversation collapses. Streaming TTS with phoneme-aligned interruption keeps the cadence natural even when the user changes their mind mid-sentence. Post-call, every transcript is run through a structured pipeline: sentiment, intent classification, lead score, escalation flag, and a normalized slot extraction (name, callback number, reason, urgency). For healthcare workloads, the BAA-covered storage path, audit logs, encryption-at-rest, and PHI-safe transcript redaction are wired in from day one, not bolted on at compliance review. The end state is a system where every call produces a row of structured data, not just a recording.

## FAQ

**What changes when you move a voice agent the way *Build a Voice Agent with Daily Bots: Hosted Pipecat Cloud (2026)* describes?**

Treat the architecture in this post as a starting point and instrument it before you tune it. The metrics that matter most early on are end-to-end latency (target < 1s for voice, < 3s for chat), barge-in correctness, tool-call success rate, and post-conversation lead score distribution. Optimize whatever the data flags as the bottleneck, not whatever feels slowest in your head.

**Where does this break down for voice agent deployments at scale?**

The two failure modes that bite hardest are silent context loss across multi-turn handoffs and tool calls that succeed in dev but get rate-limited in production. Both are solvable with a proper agent backplane that pins state to a session ID, retries with backoff, and writes every tool invocation to an audit log you can replay.

**How does the CallSphere healthcare voice agent handle a typical patient intake?**

The healthcare stack runs 14 specialist tools against 20+ database tables, captures intent and slots in real time, and produces a post-call sentiment score, lead score, and escalation flag for every conversation — so the front desk inherits a triaged queue, not a stack of voicemails.

## See it live

Book a 30-minute working session at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting) and bring a real call flow — we will walk it through the live healthcare voice agent at [healthcare.callsphere.tech](https://healthcare.callsphere.tech) and show you exactly where the production wiring sits.

---

Source: https://callsphere.ai/blog/vw9h-build-voice-agent-daily-bots-cloud-platform-2026