Skip to content
AI Engineering
AI Engineering12 min read0 views

How to Build a FastAPI WebSocket Voice Agent (Python) End-to-End

Stream microphone audio from a browser to FastAPI, fan out to OpenAI Realtime over WebSocket, and play model audio back — full Python tutorial with PCM16 24kHz.

TL;DR — FastAPI's websockets route fits naturally between a browser microphone and OpenAI Realtime. Use PCM16 at 24kHz, run two async tasks per session, and you get a clean speech-to-speech loop in ~120 lines of Python.

What you'll build

A FastAPI server that accepts a browser WebSocket carrying PCM16 24kHz audio chunks, forwards them to OpenAI Realtime, and streams model audio deltas back. A simple HTML page captures the microphone, downsamples to 24kHz Int16, and plays the response through the Web Audio API. End-to-end latency: 700–1100ms.

Prerequisites

  1. Python 3.11+ and pip install fastapi uvicorn websockets.
  2. OPENAI_API_KEY exported in your shell.
  3. Modern browser (Chrome/Safari) with microphone permission.
  4. Basic Float32 → Int16 PCM understanding (browser ships Float32; OpenAI wants Int16).
  5. Optional: pip install python-dotenv for env loading.

Architecture

flowchart LR
  Mic[Browser Mic Float32] --> DS[Downsample 24kHz Int16]
  DS -- WS --> FA[FastAPI /ws]
  FA -- WS --> OA[OpenAI Realtime]
  OA -- audio.delta --> FA
  FA -- WS --> AP[AudioPlayer Web Audio]

Step 1 — FastAPI WebSocket endpoint

```python

app.py

import os, json, asyncio, base64, websockets from fastapi import FastAPI, WebSocket from fastapi.responses import HTMLResponse

app = FastAPI() OPENAI_URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03" HEADERS = { "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "OpenAI-Beta": "realtime=v1", } ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 2 — Configure the OpenAI session

```python SESSION = { "type": "session.update", "session": { "voice": "alloy", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "input_audio_transcription": {"model": "whisper-1"}, "turn_detection": {"type": "server_vad", "threshold": 0.55, "prefix_padding_ms": 300, "silence_duration_ms": 500}, "instructions": "You are a concise voice assistant. Reply in 1-2 short sentences." } } ```

Step 3 — Bridge the two WebSockets

Use asyncio.gather so each direction runs independently. Don't await one before pumping the other — that's how you get echo and choppy audio.

```python @app.websocket("/ws") async def ws(client: WebSocket): await client.accept() async with websockets.connect(OPENAI_URL, additional_headers=HEADERS) as oai: await oai.send(json.dumps(SESSION))

    async def client_to_oai():
        try:
            while True:
                chunk = await client.receive_bytes()  # raw int16 PCM
                await oai.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(chunk).decode(),
                }))
        except Exception:
            pass

    async def oai_to_client():
        async for raw in oai:
            ev = json.loads(raw)
            if ev["type"] == "response.audio.delta":
                pcm = base64.b64decode(ev["delta"])
                await client.send_bytes(pcm)
            elif ev["type"] == "response.audio_transcript.done":
                await client.send_text(json.dumps({"role": "assistant",
                                                    "text": ev["transcript"]}))

    await asyncio.gather(client_to_oai(), oai_to_client())

```

Step 4 — Browser microphone capture (Float32 → Int16 24kHz)

```html

```

Step 5 — Play model audio in the browser

```js ws.onmessage = (e) => { if (typeof e.data === "string") return; // transcript const i16 = new Int16Array(e.data); const f32 = new Float32Array(i16.length); for (let i = 0; i < i16.length; i++) f32[i] = i16[i] / 0x7fff; const buf = ctx.createBuffer(1, f32.length, 24000); buf.copyToChannel(f32, 0); const s = ctx.createBufferSource(); s.buffer = buf; s.connect(ctx.destination); s.start(); }; ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Run it

```bash uvicorn app:app --host 0.0.0.0 --port 8000 --reload ```

Open http://localhost:8000 (serve the HTML separately or mount it on FastAPI), grant mic access, and start talking. The model should reply within ~1 second.

Common pitfalls

  • Sample rate mismatch: if your AudioContext is 48000 but you tell OpenAI 24000, voices sound chipmunk-fast. Set new AudioContext({ sampleRate: 24000 }).
  • Float32 sent as bytes: OpenAI rejects malformed audio silently. Always convert to Int16Array then send .buffer.
  • Not running both pumps concurrently: a serial loop will deadlock — use asyncio.gather.
  • additional_headers vs extra_headers: depends on your websockets lib version (>=12 uses additional_headers).

How CallSphere does this in production

CallSphere's Healthcare line uses this exact PCM16 24kHz pattern with server VAD at 0.55 — chosen because clinicians often pause mid-sentence and a stricter threshold cuts them off. After each call we run a post-call analytics job that scores sentiment (–1.0 to 1.0) and lead intent (0–100) from the transcript. The Salon vertical adds 4 specialist agents and ElevenLabs voices with GB-YYYYMMDD-### booking refs. See it live or start a trial.

FAQ

Why PCM16 24kHz instead of mu-law? Browsers can't encode mu-law cheaply, but PCM16 is one downsample step away from getUserMedia output. Mu-law is for telephony.

Can I use asyncio.create_task? Yes, but gather cancels both on exception, which is what you want.

How do I add streaming text output? Subscribe to response.audio_transcript.delta and forward strings — useful for live captions.

Production hosting? Deploy to Fly.io or k3s. Keep one process per region; FastAPI scales horizontally just fine.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.