How to Build a FastAPI WebSocket Voice Agent (Python) End-to-End
Stream microphone audio from a browser to FastAPI, fan out to OpenAI Realtime over WebSocket, and play model audio back — full Python tutorial with PCM16 24kHz.
TL;DR — FastAPI's
websocketsroute fits naturally between a browser microphone and OpenAI Realtime. Use PCM16 at 24kHz, run two async tasks per session, and you get a clean speech-to-speech loop in ~120 lines of Python.
What you'll build
A FastAPI server that accepts a browser WebSocket carrying PCM16 24kHz audio chunks, forwards them to OpenAI Realtime, and streams model audio deltas back. A simple HTML page captures the microphone, downsamples to 24kHz Int16, and plays the response through the Web Audio API. End-to-end latency: 700–1100ms.
Prerequisites
- Python 3.11+ and
pip install fastapi uvicorn websockets. OPENAI_API_KEYexported in your shell.- Modern browser (Chrome/Safari) with microphone permission.
- Basic Float32 → Int16 PCM understanding (browser ships Float32; OpenAI wants Int16).
- Optional:
pip install python-dotenvfor env loading.
Architecture
flowchart LR
Mic[Browser Mic Float32] --> DS[Downsample 24kHz Int16]
DS -- WS --> FA[FastAPI /ws]
FA -- WS --> OA[OpenAI Realtime]
OA -- audio.delta --> FA
FA -- WS --> AP[AudioPlayer Web Audio]
Step 1 — FastAPI WebSocket endpoint
```python
app.py
import os, json, asyncio, base64, websockets from fastapi import FastAPI, WebSocket from fastapi.responses import HTMLResponse
app = FastAPI() OPENAI_URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03" HEADERS = { "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "OpenAI-Beta": "realtime=v1", } ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 2 — Configure the OpenAI session
```python SESSION = { "type": "session.update", "session": { "voice": "alloy", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "input_audio_transcription": {"model": "whisper-1"}, "turn_detection": {"type": "server_vad", "threshold": 0.55, "prefix_padding_ms": 300, "silence_duration_ms": 500}, "instructions": "You are a concise voice assistant. Reply in 1-2 short sentences." } } ```
Step 3 — Bridge the two WebSockets
Use asyncio.gather so each direction runs independently. Don't await one before pumping the other — that's how you get echo and choppy audio.
```python @app.websocket("/ws") async def ws(client: WebSocket): await client.accept() async with websockets.connect(OPENAI_URL, additional_headers=HEADERS) as oai: await oai.send(json.dumps(SESSION))
async def client_to_oai():
try:
while True:
chunk = await client.receive_bytes() # raw int16 PCM
await oai.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(chunk).decode(),
}))
except Exception:
pass
async def oai_to_client():
async for raw in oai:
ev = json.loads(raw)
if ev["type"] == "response.audio.delta":
pcm = base64.b64decode(ev["delta"])
await client.send_bytes(pcm)
elif ev["type"] == "response.audio_transcript.done":
await client.send_text(json.dumps({"role": "assistant",
"text": ev["transcript"]}))
await asyncio.gather(client_to_oai(), oai_to_client())
```
Step 4 — Browser microphone capture (Float32 → Int16 24kHz)
```html
```
Step 5 — Play model audio in the browser
```js ws.onmessage = (e) => { if (typeof e.data === "string") return; // transcript const i16 = new Int16Array(e.data); const f32 = new Float32Array(i16.length); for (let i = 0; i < i16.length; i++) f32[i] = i16[i] / 0x7fff; const buf = ctx.createBuffer(1, f32.length, 24000); buf.copyToChannel(f32, 0); const s = ctx.createBufferSource(); s.buffer = buf; s.connect(ctx.destination); s.start(); }; ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Run it
```bash uvicorn app:app --host 0.0.0.0 --port 8000 --reload ```
Open http://localhost:8000 (serve the HTML separately or mount it on FastAPI), grant mic access, and start talking. The model should reply within ~1 second.
Common pitfalls
- Sample rate mismatch: if your AudioContext is 48000 but you tell OpenAI 24000, voices sound chipmunk-fast. Set
new AudioContext({ sampleRate: 24000 }). - Float32 sent as bytes: OpenAI rejects malformed audio silently. Always convert to
Int16Arraythen send.buffer. - Not running both pumps concurrently: a serial loop will deadlock — use
asyncio.gather. additional_headersvsextra_headers: depends on yourwebsocketslib version (>=12usesadditional_headers).
How CallSphere does this in production
CallSphere's Healthcare line uses this exact PCM16 24kHz pattern with server VAD at 0.55 — chosen because clinicians often pause mid-sentence and a stricter threshold cuts them off. After each call we run a post-call analytics job that scores sentiment (–1.0 to 1.0) and lead intent (0–100) from the transcript. The Salon vertical adds 4 specialist agents and ElevenLabs voices with GB-YYYYMMDD-### booking refs. See it live or start a trial.
FAQ
Why PCM16 24kHz instead of mu-law? Browsers can't encode mu-law cheaply, but PCM16 is one downsample step away from getUserMedia output. Mu-law is for telephony.
Can I use asyncio.create_task? Yes, but gather cancels both on exception, which is what you want.
How do I add streaming text output? Subscribe to response.audio_transcript.delta and forward strings — useful for live captions.
Production hosting? Deploy to Fly.io or k3s. Keep one process per region; FastAPI scales horizontally just fine.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.