TL;DR — Retell exposes a Custom LLM WebSocket contract. You expose wss://yourhost/llm-websocket/:call_id, paste it into the Retell agent config, and Retell will stream user transcripts to you and consume your token deltas as the spoken response. This is how you bring Claude, a fine-tune, or any non-OpenAI brain into Retell's sub-500ms voice stack.

What you'll build

A FastAPI WebSocket server that adapts Retell's protocol to OpenAI/Claude streaming, giving you control over context, tools, and guardrails while Retell handles STT, TTS, VAD, and PSTN.

Architecture

flowchart LR
  CL[Caller PSTN] --> RT[Retell voice runtime]
  RT -- WS user transcript --> SV[Your /llm-websocket]
  SV -- WS token deltas --> RT
  SV -- HTTP --> OA[OpenAI / Anthropic]

Step 1 — Bootstrap server

```bash pip install fastapi "uvicorn[standard]" openai anthropic websockets ```

Step 2 — Implement the contract

```python

server.py

import json, os from fastapi import FastAPI, WebSocket, WebSocketDisconnect from openai import AsyncOpenAI

app = FastAPI() oa = AsyncOpenAI()

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

SYS = "You are Ava, a friendly clinic concierge. Confirm slots; never invent times."

@app.websocket("/llm-websocket/{call_id}") async def llm(ws: WebSocket, call_id: str): await ws.accept() history = [{"role": "system", "content": SYS}] # 1. Send config first frame await ws.send_json({ "response_type": "config", "config": {"auto_reconnect": True, "call_details": True}, }) # 2. Optional begin message await ws.send_json({ "response_type": "response", "response_id": 0, "content": "Hi — Sunrise Clinic. How can I help?", "content_complete": True, "end_call": False, }) try: while True: msg = json.loads(await ws.receive_text()) if msg["interaction_type"] == "ping_pong": await ws.send_json({"response_type": "ping_pong", "timestamp": msg["timestamp"]}) continue if msg["interaction_type"] != "response_required": continue history.append({"role": "user", "content": msg["transcript"][-1]["content"]}) stream = await oa.chat.completions.create( model="gpt-4o", messages=history, stream=True, ) full = "" async for chunk in stream: delta = chunk.choices[0].delta.content or "" if not delta: continue full += delta await ws.send_json({ "response_type": "response", "response_id": msg["response_id"], "content": delta, "content_complete": False, }) await ws.send_json({ "response_type": "response", "response_id": msg["response_id"], "content": "", "content_complete": True, }) history.append({"role": "assistant", "content": full}) except WebSocketDisconnect: pass ```

Step 3 — Configure Retell

In dash.retellai.com → Agents → Edit → LLM, switch from "Retell LLM" to Custom LLM and paste: ``` wss://yourhost.com/llm-websocket ``` Retell appends /<call_id> per call.

Step 4 — Add functions

Define functions in the Retell dashboard with a url field. When the LLM should call one, emit:

```python await ws.send_json({ "response_type": "tool_call_invocation", "tool_call_id": "tc_1", "name": "book_slot", "arguments": json.dumps({"iso": "2026-05-08T15:00:00Z"}), })

Retell calls your function URL and returns a tool_call_result frame

```

Step 5 — Swap to Claude

Replace the OpenAI block with Anthropic streaming:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

```python from anthropic import AsyncAnthropic an = AsyncAnthropic() async with an.messages.stream(model="claude-3-5-sonnet-latest", max_tokens=512, system=SYS, messages=history[1:]) as s: async for delta in s.text_stream: await ws.send_json({"response_type": "response", "response_id": rid, "content": delta, "content_complete": False}) ```

Step 6 — Deploy

```bash uvicorn server:app --host 0.0.0.0 --port 8443 --ssl-keyfile k.pem --ssl-certfile c.pem ```

WSS is required by Retell — terminate TLS at your load balancer.

Pitfalls

First message: You MUST send response_id: 0 immediately or Retell stays silent.
Ping-pong: Respond within 1s or Retell tears down the call.
content_complete: Send true exactly once per turn — multiple completes confuse the runtime.
Reconnect loops: Set auto_reconnect: true in the config frame; otherwise transient WS hiccups end the call.

How CallSphere does this

CallSphere uses Retell + custom LLM for the Behavioral Health vertical where Claude's tone control beats GPT-4o; the same pattern feeds 37 agents across 6 verticals with 90+ tools and 115+ DB tables. $149/$499/$1,499 · 14-day trial · 22% affiliate.

FAQ

Latency vs Retell LLM? ~+50-150ms because of the extra WS hop — still under 600ms p50 with Claude.

Tool calls? Define in Retell dashboard, emit tool_call_invocation frames, handle results in tool_call_result.

Auth? Add a query param token; verify it in your WS accept handler.

Audio access? No — Retell handles STT/TTS; you only see text. For raw audio, use a different vendor (LiveKit/Pipecat).

Sources

Retell - Connect AI Call Agent to Custom LLM - https://www.retellai.com/integrations/custom-llm
GitHub - RetellAI/retell-custom-llm-python-demo - https://github.com/RetellAI/retell-custom-llm-python-demo
AssemblyAI Blog - Retell AI + AssemblyAI Custom LLM - https://www.assemblyai.com/blog/retell-ai-assemblyai-custom-llm-and-post-call-analytics
Sacesta - Retell AI Function Calling Guide 2026 - https://www.sacesta.com/our-work/blog/complete-guide-retell-ai-function-calling-custom-tools

Build a Voice Agent with Retell's Custom LLM URL (BYO Model, 2026)

What you'll build

Architecture

Step 1 — Bootstrap server

Step 2 — Implement the contract

server.py

Step 3 — Configure Retell

Step 4 — Add functions

Retell calls your function URL and returns a tool_call_result frame

Step 5 — Swap to Claude

Step 6 — Deploy

Pitfalls

How CallSphere does this

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency Benchmarking AI Voice Agent Vendors (2026)

Voice AI market April 2026 roundup — CallSphere, Vapi, Retell

Vapi vs Retell vs Bland (2026): The Real Production Tradeoffs

Deploy a Voice Agent on fly.io with Multi-Region Routing

Build a Multi-Region Voice Agent on Fly.io for Sub-500ms Global Latency (2026)

WebRTC vs WebSocket Voice: CallSphere Architecture Edge Over Vapi