Skip to content
AI Voice Agents
AI Voice Agents12 min read0 views

Build a Voice Agent with Retell's Custom LLM URL (BYO Model, 2026)

Retell lets you replace its LLM with a WebSocket of your own. Stream Claude or a fine-tune through Retell's voice runtime — Python WS server + pitfalls.

TL;DR — Retell exposes a Custom LLM WebSocket contract. You expose wss://yourhost/llm-websocket/:call_id, paste it into the Retell agent config, and Retell will stream user transcripts to you and consume your token deltas as the spoken response. This is how you bring Claude, a fine-tune, or any non-OpenAI brain into Retell's sub-500ms voice stack.

What you'll build

A FastAPI WebSocket server that adapts Retell's protocol to OpenAI/Claude streaming, giving you control over context, tools, and guardrails while Retell handles STT, TTS, VAD, and PSTN.

Architecture

flowchart LR
  CL[Caller PSTN] --> RT[Retell voice runtime]
  RT -- WS user transcript --> SV[Your /llm-websocket]
  SV -- WS token deltas --> RT
  SV -- HTTP --> OA[OpenAI / Anthropic]

Step 1 — Bootstrap server

```bash pip install fastapi "uvicorn[standard]" openai anthropic websockets ```

Step 2 — Implement the contract

```python

server.py

import json, os from fastapi import FastAPI, WebSocket, WebSocketDisconnect from openai import AsyncOpenAI

app = FastAPI() oa = AsyncOpenAI()

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

SYS = "You are Ava, a friendly clinic concierge. Confirm slots; never invent times."

@app.websocket("/llm-websocket/{call_id}") async def llm(ws: WebSocket, call_id: str): await ws.accept() history = [{"role": "system", "content": SYS}] # 1. Send config first frame await ws.send_json({ "response_type": "config", "config": {"auto_reconnect": True, "call_details": True}, }) # 2. Optional begin message await ws.send_json({ "response_type": "response", "response_id": 0, "content": "Hi — Sunrise Clinic. How can I help?", "content_complete": True, "end_call": False, }) try: while True: msg = json.loads(await ws.receive_text()) if msg["interaction_type"] == "ping_pong": await ws.send_json({"response_type": "ping_pong", "timestamp": msg["timestamp"]}) continue if msg["interaction_type"] != "response_required": continue history.append({"role": "user", "content": msg["transcript"][-1]["content"]}) stream = await oa.chat.completions.create( model="gpt-4o", messages=history, stream=True, ) full = "" async for chunk in stream: delta = chunk.choices[0].delta.content or "" if not delta: continue full += delta await ws.send_json({ "response_type": "response", "response_id": msg["response_id"], "content": delta, "content_complete": False, }) await ws.send_json({ "response_type": "response", "response_id": msg["response_id"], "content": "", "content_complete": True, }) history.append({"role": "assistant", "content": full}) except WebSocketDisconnect: pass ```

Step 3 — Configure Retell

In dash.retellai.com → Agents → Edit → LLM, switch from "Retell LLM" to Custom LLM and paste: ``` wss://yourhost.com/llm-websocket ``` Retell appends /<call_id> per call.

Step 4 — Add functions

Define functions in the Retell dashboard with a url field. When the LLM should call one, emit:

```python await ws.send_json({ "response_type": "tool_call_invocation", "tool_call_id": "tc_1", "name": "book_slot", "arguments": json.dumps({"iso": "2026-05-08T15:00:00Z"}), })

Retell calls your function URL and returns a tool_call_result frame

```

Step 5 — Swap to Claude

Replace the OpenAI block with Anthropic streaming:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

```python from anthropic import AsyncAnthropic an = AsyncAnthropic() async with an.messages.stream(model="claude-3-5-sonnet-latest", max_tokens=512, system=SYS, messages=history[1:]) as s: async for delta in s.text_stream: await ws.send_json({"response_type": "response", "response_id": rid, "content": delta, "content_complete": False}) ```

Step 6 — Deploy

```bash uvicorn server:app --host 0.0.0.0 --port 8443 --ssl-keyfile k.pem --ssl-certfile c.pem ```

WSS is required by Retell — terminate TLS at your load balancer.

Pitfalls

  • First message: You MUST send response_id: 0 immediately or Retell stays silent.
  • Ping-pong: Respond within 1s or Retell tears down the call.
  • content_complete: Send true exactly once per turn — multiple completes confuse the runtime.
  • Reconnect loops: Set auto_reconnect: true in the config frame; otherwise transient WS hiccups end the call.

How CallSphere does this

CallSphere uses Retell + custom LLM for the Behavioral Health vertical where Claude's tone control beats GPT-4o; the same pattern feeds 37 agents across 6 verticals with 90+ tools and 115+ DB tables. $149/$499/$1,499 · 14-day trial · 22% affiliate.

FAQ

Latency vs Retell LLM? ~+50-150ms because of the extra WS hop — still under 600ms p50 with Claude.

Tool calls? Define in Retell dashboard, emit tool_call_invocation frames, handle results in tool_call_result.

Auth? Add a query param token; verify it in your WS accept handler.

Audio access? No — Retell handles STT/TTS; you only see text. For raw audio, use a different vendor (LiveKit/Pipecat).

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.