Build a Voice Agent with Retell's Custom LLM URL (BYO Model, 2026)
Retell lets you replace its LLM with a WebSocket of your own. Stream Claude or a fine-tune through Retell's voice runtime — Python WS server + pitfalls.
TL;DR — Retell exposes a Custom LLM WebSocket contract. You expose
wss://yourhost/llm-websocket/:call_id, paste it into the Retell agent config, and Retell will stream user transcripts to you and consume your token deltas as the spoken response. This is how you bring Claude, a fine-tune, or any non-OpenAI brain into Retell's sub-500ms voice stack.
What you'll build
A FastAPI WebSocket server that adapts Retell's protocol to OpenAI/Claude streaming, giving you control over context, tools, and guardrails while Retell handles STT, TTS, VAD, and PSTN.
Architecture
flowchart LR
CL[Caller PSTN] --> RT[Retell voice runtime]
RT -- WS user transcript --> SV[Your /llm-websocket]
SV -- WS token deltas --> RT
SV -- HTTP --> OA[OpenAI / Anthropic]
Step 1 — Bootstrap server
```bash pip install fastapi "uvicorn[standard]" openai anthropic websockets ```
Step 2 — Implement the contract
```python
server.py
import json, os from fastapi import FastAPI, WebSocket, WebSocketDisconnect from openai import AsyncOpenAI
app = FastAPI() oa = AsyncOpenAI()
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
SYS = "You are Ava, a friendly clinic concierge. Confirm slots; never invent times."
@app.websocket("/llm-websocket/{call_id}") async def llm(ws: WebSocket, call_id: str): await ws.accept() history = [{"role": "system", "content": SYS}] # 1. Send config first frame await ws.send_json({ "response_type": "config", "config": {"auto_reconnect": True, "call_details": True}, }) # 2. Optional begin message await ws.send_json({ "response_type": "response", "response_id": 0, "content": "Hi — Sunrise Clinic. How can I help?", "content_complete": True, "end_call": False, }) try: while True: msg = json.loads(await ws.receive_text()) if msg["interaction_type"] == "ping_pong": await ws.send_json({"response_type": "ping_pong", "timestamp": msg["timestamp"]}) continue if msg["interaction_type"] != "response_required": continue history.append({"role": "user", "content": msg["transcript"][-1]["content"]}) stream = await oa.chat.completions.create( model="gpt-4o", messages=history, stream=True, ) full = "" async for chunk in stream: delta = chunk.choices[0].delta.content or "" if not delta: continue full += delta await ws.send_json({ "response_type": "response", "response_id": msg["response_id"], "content": delta, "content_complete": False, }) await ws.send_json({ "response_type": "response", "response_id": msg["response_id"], "content": "", "content_complete": True, }) history.append({"role": "assistant", "content": full}) except WebSocketDisconnect: pass ```
Step 3 — Configure Retell
In dash.retellai.com → Agents → Edit → LLM, switch from "Retell LLM" to Custom LLM and paste:
```
wss://yourhost.com/llm-websocket
```
Retell appends /<call_id> per call.
Step 4 — Add functions
Define functions in the Retell dashboard with a url field. When the LLM should call one, emit:
```python await ws.send_json({ "response_type": "tool_call_invocation", "tool_call_id": "tc_1", "name": "book_slot", "arguments": json.dumps({"iso": "2026-05-08T15:00:00Z"}), })
Retell calls your function URL and returns a tool_call_result frame
```
Step 5 — Swap to Claude
Replace the OpenAI block with Anthropic streaming:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```python from anthropic import AsyncAnthropic an = AsyncAnthropic() async with an.messages.stream(model="claude-3-5-sonnet-latest", max_tokens=512, system=SYS, messages=history[1:]) as s: async for delta in s.text_stream: await ws.send_json({"response_type": "response", "response_id": rid, "content": delta, "content_complete": False}) ```
Step 6 — Deploy
```bash uvicorn server:app --host 0.0.0.0 --port 8443 --ssl-keyfile k.pem --ssl-certfile c.pem ```
WSS is required by Retell — terminate TLS at your load balancer.
Pitfalls
- First message: You MUST send
response_id: 0immediately or Retell stays silent. - Ping-pong: Respond within 1s or Retell tears down the call.
content_complete: Sendtrueexactly once per turn — multiple completes confuse the runtime.- Reconnect loops: Set
auto_reconnect: truein the config frame; otherwise transient WS hiccups end the call.
How CallSphere does this
CallSphere uses Retell + custom LLM for the Behavioral Health vertical where Claude's tone control beats GPT-4o; the same pattern feeds 37 agents across 6 verticals with 90+ tools and 115+ DB tables. $149/$499/$1,499 · 14-day trial · 22% affiliate.
FAQ
Latency vs Retell LLM? ~+50-150ms because of the extra WS hop — still under 600ms p50 with Claude.
Tool calls? Define in Retell dashboard, emit tool_call_invocation frames, handle results in tool_call_result.
Auth? Add a query param token; verify it in your WS accept handler.
Audio access? No — Retell handles STT/TTS; you only see text. For raw audio, use a different vendor (LiveKit/Pipecat).
Sources
- Retell - Connect AI Call Agent to Custom LLM - https://www.retellai.com/integrations/custom-llm
- GitHub - RetellAI/retell-custom-llm-python-demo - https://github.com/RetellAI/retell-custom-llm-python-demo
- AssemblyAI Blog - Retell AI + AssemblyAI Custom LLM - https://www.assemblyai.com/blog/retell-ai-assemblyai-custom-llm-and-post-call-analytics
- Sacesta - Retell AI Function Calling Guide 2026 - https://www.sacesta.com/our-work/blog/complete-guide-retell-ai-function-calling-custom-tools
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.