By Sagar Shankaran, Founder of CallSphere
Retell lets you replace its LLM with a WebSocket of your own. Stream Claude or a fine-tune through Retell's voice runtime — Python WS server + pitfalls.
Key takeaways
TL;DR — Retell exposes a Custom LLM WebSocket contract. You expose
wss://yourhost/llm-websocket/:call_id, paste it into the Retell agent config, and Retell will stream user transcripts to you and consume your token deltas as the spoken response. This is how you bring Claude, a fine-tune, or any non-OpenAI brain into Retell's sub-500ms voice stack.
A FastAPI WebSocket server that adapts Retell's protocol to OpenAI/Claude streaming, giving you control over context, tools, and guardrails while Retell handles STT, TTS, VAD, and PSTN.
flowchart LR
CL[Caller PSTN] --> RT[Retell voice runtime]
RT -- WS user transcript --> SV[Your /llm-websocket]
SV -- WS token deltas --> RT
SV -- HTTP --> OA[OpenAI / Anthropic]
```bash pip install fastapi "uvicorn[standard]" openai anthropic websockets ```
```python
import json, os from fastapi import FastAPI, WebSocket, WebSocketDisconnect from openai import AsyncOpenAI
app = FastAPI() oa = AsyncOpenAI()
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
SYS = "You are Ava, a friendly clinic concierge. Confirm slots; never invent times."
@app.websocket("/llm-websocket/{call_id}") async def llm(ws: WebSocket, call_id: str): await ws.accept() history = [{"role": "system", "content": SYS}] # 1. Send config first frame await ws.send_json({ "response_type": "config", "config": {"auto_reconnect": True, "call_details": True}, }) # 2. Optional begin message await ws.send_json({ "response_type": "response", "response_id": 0, "content": "Hi — Sunrise Clinic. How can I help?", "content_complete": True, "end_call": False, }) try: while True: msg = json.loads(await ws.receive_text()) if msg["interaction_type"] == "ping_pong": await ws.send_json({"response_type": "ping_pong", "timestamp": msg["timestamp"]}) continue if msg["interaction_type"] != "response_required": continue history.append({"role": "user", "content": msg["transcript"][-1]["content"]}) stream = await oa.chat.completions.create( model="gpt-4o", messages=history, stream=True, ) full = "" async for chunk in stream: delta = chunk.choices[0].delta.content or "" if not delta: continue full += delta await ws.send_json({ "response_type": "response", "response_id": msg["response_id"], "content": delta, "content_complete": False, }) await ws.send_json({ "response_type": "response", "response_id": msg["response_id"], "content": "", "content_complete": True, }) history.append({"role": "assistant", "content": full}) except WebSocketDisconnect: pass ```
In dash.retellai.com → Agents → Edit → LLM, switch from "Retell LLM" to Custom LLM and paste:
```
wss://yourhost.com/llm-websocket
```
Retell appends /<call_id> per call.
Define functions in the Retell dashboard with a url field. When the LLM should call one, emit:
```python await ws.send_json({ "response_type": "tool_call_invocation", "tool_call_id": "tc_1", "name": "book_slot", "arguments": json.dumps({"iso": "2026-05-08T15:00:00Z"}), })
```
Replace the OpenAI block with Anthropic streaming:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```python from anthropic import AsyncAnthropic an = AsyncAnthropic() async with an.messages.stream(model="claude-3-5-sonnet-latest", max_tokens=512, system=SYS, messages=history[1:]) as s: async for delta in s.text_stream: await ws.send_json({"response_type": "response", "response_id": rid, "content": delta, "content_complete": False}) ```
```bash uvicorn server:app --host 0.0.0.0 --port 8443 --ssl-keyfile k.pem --ssl-certfile c.pem ```
WSS is required by Retell — terminate TLS at your load balancer.
response_id: 0 immediately or Retell stays silent.content_complete: Send true exactly once per turn — multiple completes confuse the runtime.auto_reconnect: true in the config frame; otherwise transient WS hiccups end the call.CallSphere uses Retell + custom LLM for the Behavioral Health vertical where Claude's tone control beats GPT-4o; the same pattern feeds 37 agents across 6 verticals with 90+ tools and 115+ DB tables. $149/$499/$1,499 · 14-day trial · 22% affiliate.
Latency vs Retell LLM? ~+50-150ms because of the extra WS hop — still under 600ms p50 with Claude.
Tool calls? Define in Retell dashboard, emit tool_call_invocation frames, handle results in tool_call_result.
Auth? Add a query param token; verify it in your WS accept handler.
Audio access? No — Retell handles STT/TTS; you only see text. For raw audio, use a different vendor (LiveKit/Pipecat).
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
The voice AI market hits $47.5B by 2034. For gyms and PT studios, voice agents now make economic sense for member intake, upsells, and reactivation campaigns.
With the voice AI market at $47.5B by 2034 and OpenAI's realtime release this week, every dealership and service shop should be evaluating voice agents. Here's how.
Spring 2026 AC season starts now. With the voice AI market at $47.5B by 2034, HVAC shops without after-hours voice agents will lose to those that have them.
OpenAI's GPT-Realtime-Translate handles 70 input languages live at $0.034/min. Here is what that means for multilingual restaurant takeout — and how CallSphere ships it.
OpenAI's GPT-Realtime-Translate hits 70 languages at $0.034/min. For dental practices in diverse metros, this changes who picks up the phone — and who books the appointment.
Google Cloud Next rebranded Vertex AI as Gemini Enterprise Agent Platform with 2M context. Here is what that means for salon and beauty bookings — and where CallSphere fits.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.