---
title: "Build a Voice Agent with Vercel AI SDK + Twilio (2026)"
description: "Use the Vercel AI SDK's transcription/speech functions plus Twilio Media Streams to ship a voice agent on Vercel Edge Runtime. Real Next.js Route Handler, working code, deploy in 5 min."
canonical: https://callsphere.ai/blog/vw5h-build-voice-agent-vercel-ai-sdk-twilio
category: "AI Voice Agents"
tags: ["Vercel", "AI SDK", "Twilio", "Next.js", "Tutorial"]
author: "CallSphere Team"
published: 2026-04-11T00:00:00.000Z
updated: 2026-05-07T16:30:08.055Z
---

# Build a Voice Agent with Vercel AI SDK + Twilio (2026)

> Use the Vercel AI SDK's transcription/speech functions plus Twilio Media Streams to ship a voice agent on Vercel Edge Runtime. Real Next.js Route Handler, working code, deploy in 5 min.

> **TL;DR** — Vercel AI SDK 5 ships `transcribe()` and `speak()` as first-class functions across providers, plus the `generateText` / `streamText` agent loop. Combined with Twilio Media Streams over WebSockets in a Next.js Route Handler running on Node runtime (not Edge for WS), you get a voice agent deployed to Vercel in five minutes.

## What you'll build

A Next.js 15 app with three routes:

- `POST /api/twilio/voice` returns TwiML pointing at a WS endpoint
- `GET /api/twilio/media` (WebSocket upgrade) bridges Twilio audio to a sandwich agent
- The sandwich uses `transcribe(whisper-1)` → `streamText(gpt-5)` → `speak(elevenlabs)`

Deployed to Vercel with one push. Twilio webhook hits the production URL.

## Prerequisites

1. Vercel project + Twilio account with a number.
2. Node 20, Next.js 15, `ai` v5, `@ai-sdk/openai`, `@ai-sdk/elevenlabs`.
3. Twilio webhook URL set to your Vercel deployment.
4. `OPENAI_API_KEY`, `ELEVENLABS_API_KEY` in Vercel env.

## Architecture

```mermaid
flowchart LR
  C[Caller] --> T[Twilio]
  T -->|HTTP TwiML| API[/api/twilio/voice]
  API -->|| T
  T -->|wss media| WS[/api/twilio/media]
  WS -->|transcribe| W[Whisper]
  W -->|text| LLM[streamText gpt-5]
  LLM -->|text| TTS[speak ElevenLabs]
  TTS --> WS
  WS --> T --> C
```

## Step 1 — TwiML route

```ts
// app/api/twilio/voice/route.ts
export async function POST(req: Request) {
  const host = req.headers.get("host");
  const xml = `
`;
  return new Response(xml, { headers: { "content-type": "text/xml" } });
}
```

## Step 2 — WebSocket Route Handler (Node runtime)

```ts
// app/api/twilio/media/route.ts
export const runtime = "nodejs";
import { WebSocketServer } from "ws";
import { transcribe, generateText, experimental_generateSpeech as speak } from "ai";
import { openai } from "@ai-sdk/openai";
import { elevenlabs } from "@ai-sdk/elevenlabs";

let wss: WebSocketServer | null = null;
function init(server: any) {
  if (wss) return;
  wss = new WebSocketServer({ server, path: "/api/twilio/media" });
  wss.on("connection", handleConn);
}

async function handleConn(ws: any) {
  const buffer: Buffer[] = [];
  let streamSid = "";
  ws.on("message", async (raw: any) => {
    const ev = JSON.parse(raw.toString());
    if (ev.event === "start") streamSid = ev.streamSid;
    if (ev.event === "media") {
      buffer.push(Buffer.from(ev.media.payload, "base64"));
      if (buffer.length > 50) await respond(ws, streamSid, buffer.splice(0));
    }
  });
}

async function respond(ws: any, sid: string, frames: Buffer[]) {
  const wav = mulawToWav(Buffer.concat(frames));
  const { text } = await transcribe({ model: openai.transcription("whisper-1"), audio: wav });
  if (!text) return;
  const { text: reply } = await generateText({
    model: openai("gpt-5"),
    system: "You are a friendly receptionist. Reply in one sentence.",
    prompt: text,
  });
  const audio = await speak({ model: elevenlabs.speech("eleven_turbo_v2_5"), text: reply });
  for (const chunk of chunked(audio.audio, 320)) {
    ws.send(JSON.stringify({ event: "media", streamSid: sid, media: { payload: pcmToMulaw(chunk).toString("base64") } }));
  }
}
```

## Step 3 — Mu-law transcoding helpers

Twilio sends mu-law 8kHz; Whisper wants WAV/PCM. ElevenLabs returns PCM 16kHz; convert to mu-law 8kHz before sending back.

```ts
import { Readable } from "stream";
function mulawToWav(mulaw: Buffer): Buffer {
  // 44-byte WAV header + linear PCM 8kHz mono
  // (use pcm-util or write manually)
  return makeWavHeader(8000, decodeMulaw(mulaw));
}
```

For production, use `@scramjet/audio-utils` or a small WASM transcoder.

## Step 4 — Hot-path streaming with `streamText` + sentence-by-sentence TTS

Replace `generateText` with `streamText` and chunk on sentence boundaries to start TTS earlier:

```ts
const { textStream } = await streamText({ model: openai("gpt-5"), prompt: text });
let buf = "";
for await (const delta of textStream) {
  buf += delta;
  const m = buf.match(/^([^.!?]+[.!?])\s*/);
  if (m) { speakAndSend(m[1]); buf = buf.slice(m[0].length); }
}
```

Cuts perceived latency by ~40%.

## Step 5 — Twilio webhook config

In the Twilio console → your number → `A Call Comes In` → Webhook → `https://your-app.vercel.app/api/twilio/voice`. Done.

## Step 6 — Deploy

```bash
vercel --prod
```

WebSocket support requires the **Hobby+** plan with Functions running on Node runtime, not Edge. Set `runtime: "nodejs"` in the route file.

## Step 7 — Add tools (function calling)

```ts
import { tool } from "ai";
import { z } from "zod";

const tools = {
  lookup_appointment: tool({
    description: "Get next appointment for a patient",
    parameters: z.object({ patient_id: z.string() }),
    execute: async ({ patient_id }) => fetchAppt(patient_id),
  })
};
const { text: reply } = await generateText({ model: openai("gpt-5"), tools, prompt: text });
```

## Pitfalls

- **Edge runtime can't open outbound WS** — must use Node runtime for the Twilio WS endpoint.
- **Vercel function timeout** is 60s on Hobby; voice calls are longer. Bump to Pro (300s) or move WS to a long-running service.
- **Cold starts**: pin Pro to `fluid: true` for warm function pools, or use Vercel's new "Functions / Sandbox" preview that holds connections.
- **Twilio buffer size**: 50 frames ~= 1 second; tune based on barge-in needs.
- **AI SDK `experimental_generateSpeech`**: API name may move out of experimental; pin SDK version.

## How CallSphere does this in production

CallSphere's Healthcare voice path uses OpenAI Realtime directly over Twilio (not the sandwich pattern) because Realtime cuts ~400ms vs STT+LLM+TTS chains. We use Vercel only for the marketing site. Our voice agent lives on FastAPI :8084. 37 agents, 90+ tools, 115+ DB tables, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate.

## FAQ

**Q: Why not use Realtime API directly?**
You can — replace the sandwich with the Vercel AI SDK's `experimental_realtime` (in beta May 2026) for native bidirectional. The sandwich pattern is more debuggable, Realtime is faster.

**Q: Does this work on Vercel Edge?**
No, Edge runtime doesn't support WebSocket server upgrades. Use Node runtime.

**Q: Latency target?**
Sandwich pattern: ~1.2s voice-to-voice. Realtime: ~700ms.

**Q: ElevenLabs vs OpenAI TTS?**
ElevenLabs Turbo v2.5 is ~150ms first-byte vs OpenAI TTS-1 at ~300ms. ElevenLabs voices sound better. Cost: about the same.

**Q: How do I add a vector store / RAG?**
Use `@ai-sdk/openai` embeddings + Vercel KV for cheap dev; for prod, point at Pinecone or pg-vector.

## Sources

- [How to build AI Agents with Vercel and the AI SDK](https://vercel.com/kb/guide/how-to-build-ai-agents-with-vercel-and-the-ai-sdk)
- [Vercel AI SDK Voice Elements changelog](https://vercel.com/changelog/ai-voice-elements)
- [Build an AI Voice Assistant with Twilio Voice + OpenAI Realtime SDK + Node](https://www.twilio.com/en-us/blog/developers/tutorials/product/speech-assistant-realtime-agents-sdk-node)
- [Vercel AI SDK introduction](https://ai-sdk.dev/docs/introduction)
- [Build a voice agent in JavaScript with Vercel AI SDK — DEV](https://dev.to/mkp_bijit/build-a-voice-agent-in-javascript-with-vercel-ai-sdk-1dc3)

---

Source: https://callsphere.ai/blog/vw5h-build-voice-agent-vercel-ai-sdk-twilio