How to Build a FastAPI WebSocket Voice Agent (Python) End-to-End
By Sagar Shankaran, Founder of CallSphere
Stream microphone audio from a browser to FastAPI, fan out to OpenAI Realtime over WebSocket, and play model audio back — full Python tutorial with PCM16 24kHz.
Key takeaways
TL;DR — FastAPI's
websocketsroute fits naturally between a browser microphone and OpenAI Realtime. Use PCM16 at 24kHz, run two async tasks per session, and you get a clean speech-to-speech loop in ~120 lines of Python.
What you'll build
A FastAPI server that accepts a browser WebSocket carrying PCM16 24kHz audio chunks, forwards them to OpenAI Realtime, and streams model audio deltas back. A simple HTML page captures the microphone, downsamples to 24kHz Int16, and plays the response through the Web Audio API. End-to-end latency: 700–1100ms.
Prerequisites
- Python 3.11+ and
pip install fastapi uvicorn websockets. OPENAI_API_KEYexported in your shell.- Modern browser (Chrome/Safari) with microphone permission.
- Basic Float32 → Int16 PCM understanding (browser ships Float32; OpenAI wants Int16).
- Optional:
pip install python-dotenvfor env loading.
Architecture
flowchart LR
Mic[Browser Mic Float32] --> DS[Downsample 24kHz Int16]
DS -- WS --> FA[FastAPI /ws]
FA -- WS --> OA[OpenAI Realtime]
OA -- audio.delta --> FA
FA -- WS --> AP[AudioPlayer Web Audio]
Step 1 — FastAPI WebSocket endpoint
```python
app.py
import os, json, asyncio, base64, websockets from fastapi import FastAPI, WebSocket from fastapi.responses import HTMLResponse
app = FastAPI() OPENAI_URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03" HEADERS = { "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "OpenAI-Beta": "realtime=v1", } ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 2 — Configure the OpenAI session
```python SESSION = { "type": "session.update", "session": { "voice": "alloy", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "input_audio_transcription": {"model": "whisper-1"}, "turn_detection": {"type": "server_vad", "threshold": 0.55, "prefix_padding_ms": 300, "silence_duration_ms": 500}, "instructions": "You are a concise voice assistant. Reply in 1-2 short sentences." } } ```
Step 3 — Bridge the two WebSockets
Use asyncio.gather so each direction runs independently. Don't await one before pumping the other — that's how you get echo and choppy audio.
```python @app.websocket("/ws") async def ws(client: WebSocket): await client.accept() async with websockets.connect(OPENAI_URL, additional_headers=HEADERS) as oai: await oai.send(json.dumps(SESSION))
async def client_to_oai():
try:
while True:
chunk = await client.receive_bytes() # raw int16 PCM
await oai.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(chunk).decode(),
}))
except Exception:
pass
async def oai_to_client():
async for raw in oai:
ev = json.loads(raw)
if ev["type"] == "response.audio.delta":
pcm = base64.b64decode(ev["delta"])
await client.send_bytes(pcm)
elif ev["type"] == "response.audio_transcript.done":
await client.send_text(json.dumps({"role": "assistant",
"text": ev["transcript"]}))
await asyncio.gather(client_to_oai(), oai_to_client())
```
Step 4 — Browser microphone capture (Float32 → Int16 24kHz)
```html
```
Step 5 — Play model audio in the browser
```js ws.onmessage = (e) => { if (typeof e.data === "string") return; // transcript const i16 = new Int16Array(e.data); const f32 = new Float32Array(i16.length); for (let i = 0; i < i16.length; i++) f32[i] = i16[i] / 0x7fff; const buf = ctx.createBuffer(1, f32.length, 24000); buf.copyToChannel(f32, 0); const s = ctx.createBufferSource(); s.buffer = buf; s.connect(ctx.destination); s.start(); }; ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Run it
```bash uvicorn app:app --host 0.0.0.0 --port 8000 --reload ```
Open http://localhost:8000 (serve the HTML separately or mount it on FastAPI), grant mic access, and start talking. The model should reply within ~1 second.
Common pitfalls
- Sample rate mismatch: if your AudioContext is 48000 but you tell OpenAI 24000, voices sound chipmunk-fast. Set
new AudioContext({ sampleRate: 24000 }). - Float32 sent as bytes: OpenAI rejects malformed audio silently. Always convert to
Int16Arraythen send.buffer. - Not running both pumps concurrently: a serial loop will deadlock — use
asyncio.gather. additional_headersvsextra_headers: depends on yourwebsocketslib version (>=12usesadditional_headers).
How CallSphere does this in production
CallSphere's Healthcare line uses this exact PCM16 24kHz pattern with server VAD at 0.55 — chosen because clinicians often pause mid-sentence and a stricter threshold cuts them off. After each call we run a post-call analytics job that scores sentiment (–1.0 to 1.0) and lead intent (0–100) from the transcript. The Salon vertical adds 4 specialist agents and ElevenLabs voices with GB-YYYYMMDD-### booking refs. See it live or start a trial.
FAQ
Why PCM16 24kHz instead of mu-law? Browsers can't encode mu-law cheaply, but PCM16 is one downsample step away from getUserMedia output. Mu-law is for telephony.
Can I use asyncio.create_task? Yes, but gather cancels both on exception, which is what you want.
How do I add streaming text output? Subscribe to response.audio_transcript.delta and forward strings — useful for live captions.
Production hosting? Deploy to Fly.io or k3s. Keep one process per region; FastAPI scales horizontally just fine.
Sources
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.