---
title: "Build a Voice Agent with Retell's Custom LLM URL (BYO Model, 2026)"
description: "Retell lets you replace its LLM with a WebSocket of your own. Stream Claude or a fine-tune through Retell's voice runtime — Python WS server + pitfalls."
canonical: https://callsphere.ai/blog/vw9h-build-voice-agent-retell-custom-llm-url-2026
category: "AI Voice Agents"
tags: ["Retell", "Custom LLM", "Voice Agent", "WebSocket", "BYO Model"]
author: "CallSphere Team"
published: 2026-03-27T00:00:00.000Z
updated: 2026-05-08T03:13:53.429Z
---

# Build a Voice Agent with Retell's Custom LLM URL (BYO Model, 2026)

> Retell lets you replace its LLM with a WebSocket of your own. Stream Claude or a fine-tune through Retell's voice runtime — Python WS server + pitfalls.

> **TL;DR** — Retell exposes a Custom LLM WebSocket contract. You expose `wss://yourhost/llm-websocket/:call_id`, paste it into the Retell agent config, and Retell will stream user transcripts to you and consume your token deltas as the spoken response. This is how you bring Claude, a fine-tune, or any non-OpenAI brain into Retell's sub-500ms voice stack.

## What you'll build

A FastAPI WebSocket server that adapts Retell's protocol to OpenAI/Claude streaming, giving you control over context, tools, and guardrails while Retell handles STT, TTS, VAD, and PSTN.

## Architecture

```mermaid
flowchart LR
  CL[Caller PSTN] --> RT[Retell voice runtime]
  RT -- WS user transcript --> SV[Your /llm-websocket]
  SV -- WS token deltas --> RT
  SV -- HTTP --> OA[OpenAI / Anthropic]
```

## Step 1 — Bootstrap server

```bash
pip install fastapi "uvicorn[standard]" openai anthropic websockets
```

## Step 2 — Implement the contract

```python

# server.py

import json, os
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from openai import AsyncOpenAI

app = FastAPI()
oa = AsyncOpenAI()

SYS = "You are Ava, a friendly clinic concierge. Confirm slots; never invent times."

@app.websocket("/llm-websocket/{call_id}")
async def llm(ws: WebSocket, call_id: str):
    await ws.accept()
    history = [{"role": "system", "content": SYS}]
    # 1. Send config first frame
    await ws.send_json({
        "response_type": "config",
        "config": {"auto_reconnect": True, "call_details": True},
    })
    # 2. Optional begin message
    await ws.send_json({
        "response_type": "response",
        "response_id": 0,
        "content": "Hi — Sunrise Clinic. How can I help?",
        "content_complete": True, "end_call": False,
    })
    try:
        while True:
            msg = json.loads(await ws.receive_text())
            if msg["interaction_type"] == "ping_pong":
                await ws.send_json({"response_type": "ping_pong",
                                    "timestamp": msg["timestamp"]})
                continue
            if msg["interaction_type"] != "response_required":
                continue
            history.append({"role": "user",
                            "content": msg["transcript"][-1]["content"]})
            stream = await oa.chat.completions.create(
                model="gpt-4o", messages=history, stream=True,
            )
            full = ""
            async for chunk in stream:
                delta = chunk.choices[0].delta.content or ""
                if not delta: continue
                full += delta
                await ws.send_json({
                    "response_type": "response",
                    "response_id": msg["response_id"],
                    "content": delta,
                    "content_complete": False,
                })
            await ws.send_json({
                "response_type": "response",
                "response_id": msg["response_id"],
                "content": "", "content_complete": True,
            })
            history.append({"role": "assistant", "content": full})
    except WebSocketDisconnect:
        pass
```

## Step 3 — Configure Retell

In dash.retellai.com → Agents → Edit → LLM, switch from "Retell LLM" to **Custom LLM** and paste:
```
wss://yourhost.com/llm-websocket
```
Retell appends `/` per call.

## Step 4 — Add functions

Define functions in the Retell dashboard with a `url` field. When the LLM should call one, emit:

```python
await ws.send_json({
  "response_type": "tool_call_invocation",
  "tool_call_id": "tc_1",
  "name": "book_slot",
  "arguments": json.dumps({"iso": "2026-05-08T15:00:00Z"}),
})

# Retell calls your function URL and returns a tool_call_result frame

```

## Step 5 — Swap to Claude

Replace the OpenAI block with Anthropic streaming:

```python
from anthropic import AsyncAnthropic
an = AsyncAnthropic()
async with an.messages.stream(model="claude-3-5-sonnet-latest",
                              max_tokens=512, system=SYS,
                              messages=history[1:]) as s:
    async for delta in s.text_stream:
        await ws.send_json({"response_type": "response",
                            "response_id": rid, "content": delta,
                            "content_complete": False})
```

## Step 6 — Deploy

```bash
uvicorn server:app --host 0.0.0.0 --port 8443 --ssl-keyfile k.pem --ssl-certfile c.pem
```

WSS is required by Retell — terminate TLS at your load balancer.

## Pitfalls

- **First message**: You MUST send `response_id: 0` immediately or Retell stays silent.
- **Ping-pong**: Respond within 1s or Retell tears down the call.
- **`content_complete`**: Send `true` exactly once per turn — multiple completes confuse the runtime.
- **Reconnect loops**: Set `auto_reconnect: true` in the config frame; otherwise transient WS hiccups end the call.

## How CallSphere does this

CallSphere uses Retell + custom LLM for the Behavioral Health vertical where Claude's tone control beats GPT-4o; the same pattern feeds **37 agents** across **6 verticals** with **90+ tools** and **115+ DB tables**. **$149/$499/$1,499 · 14-day trial · 22% affiliate**.

## FAQ

**Latency vs Retell LLM?** ~+50-150ms because of the extra WS hop — still under 600ms p50 with Claude.

**Tool calls?** Define in Retell dashboard, emit `tool_call_invocation` frames, handle results in `tool_call_result`.

**Auth?** Add a query param token; verify it in your WS `accept` handler.

**Audio access?** No — Retell handles STT/TTS; you only see text. For raw audio, use a different vendor (LiveKit/Pipecat).

## Sources

- Retell - Connect AI Call Agent to Custom LLM - [https://www.retellai.com/integrations/custom-llm](https://www.retellai.com/integrations/custom-llm)
- GitHub - RetellAI/retell-custom-llm-python-demo - [https://github.com/RetellAI/retell-custom-llm-python-demo](https://github.com/RetellAI/retell-custom-llm-python-demo)
- AssemblyAI Blog - Retell AI + AssemblyAI Custom LLM - [https://www.assemblyai.com/blog/retell-ai-assemblyai-custom-llm-and-post-call-analytics](https://www.assemblyai.com/blog/retell-ai-assemblyai-custom-llm-and-post-call-analytics)
- Sacesta - Retell AI Function Calling Guide 2026 - [https://www.sacesta.com/our-work/blog/complete-guide-retell-ai-function-calling-custom-tools](https://www.sacesta.com/our-work/blog/complete-guide-retell-ai-function-calling-custom-tools)

---

Source: https://callsphere.ai/blog/vw9h-build-voice-agent-retell-custom-llm-url-2026