Skip to content
How to Build a FastAPI WebSocket Voice Agent (Python) End-to-End
Agentic AI & LLMs12 min read32 views

How to Build a FastAPI WebSocket Voice Agent (Python) End-to-End

By Sagar Shankaran, Founder of CallSphere

Quick answer

Stream microphone audio from a browser to FastAPI, fan out to OpenAI Realtime over WebSocket, and play model audio back — full Python tutorial with PCM16 24kHz.

Key takeaways

TL;DR — FastAPI's websockets route fits naturally between a browser microphone and OpenAI Realtime. Use PCM16 at 24kHz, run two async tasks per session, and you get a clean speech-to-speech loop in ~120 lines of Python.

What you'll build

A FastAPI server that accepts a browser WebSocket carrying PCM16 24kHz audio chunks, forwards them to OpenAI Realtime, and streams model audio deltas back. A simple HTML page captures the microphone, downsamples to 24kHz Int16, and plays the response through the Web Audio API. End-to-end latency: 700–1100ms.

Prerequisites

  1. Python 3.11+ and pip install fastapi uvicorn websockets.
  2. OPENAI_API_KEY exported in your shell.
  3. Modern browser (Chrome/Safari) with microphone permission.
  4. Basic Float32 → Int16 PCM understanding (browser ships Float32; OpenAI wants Int16).
  5. Optional: pip install python-dotenv for env loading.

Architecture

flowchart LR
  Mic[Browser Mic Float32] --> DS[Downsample 24kHz Int16]
  DS -- WS --> FA[FastAPI /ws]
  FA -- WS --> OA[OpenAI Realtime]
  OA -- audio.delta --> FA
  FA -- WS --> AP[AudioPlayer Web Audio]

Step 1 — FastAPI WebSocket endpoint

```python

app.py

import os, json, asyncio, base64, websockets from fastapi import FastAPI, WebSocket from fastapi.responses import HTMLResponse

app = FastAPI() OPENAI_URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03" HEADERS = { "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}", "OpenAI-Beta": "realtime=v1", } ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 2 — Configure the OpenAI session

```python SESSION = { "type": "session.update", "session": { "voice": "alloy", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "input_audio_transcription": {"model": "whisper-1"}, "turn_detection": {"type": "server_vad", "threshold": 0.55, "prefix_padding_ms": 300, "silence_duration_ms": 500}, "instructions": "You are a concise voice assistant. Reply in 1-2 short sentences." } } ```

Step 3 — Bridge the two WebSockets

Use asyncio.gather so each direction runs independently. Don't await one before pumping the other — that's how you get echo and choppy audio.

```python @app.websocket("/ws") async def ws(client: WebSocket): await client.accept() async with websockets.connect(OPENAI_URL, additional_headers=HEADERS) as oai: await oai.send(json.dumps(SESSION))

    async def client_to_oai():
        try:
            while True:
                chunk = await client.receive_bytes()  # raw int16 PCM
                await oai.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(chunk).decode(),
                }))
        except Exception:
            pass

    async def oai_to_client():
        async for raw in oai:
            ev = json.loads(raw)
            if ev["type"] == "response.audio.delta":
                pcm = base64.b64decode(ev["delta"])
                await client.send_bytes(pcm)
            elif ev["type"] == "response.audio_transcript.done":
                await client.send_text(json.dumps({"role": "assistant",
                                                    "text": ev["transcript"]}))

    await asyncio.gather(client_to_oai(), oai_to_client())

```

Step 4 — Browser microphone capture (Float32 → Int16 24kHz)

```html

```

Step 5 — Play model audio in the browser

```js ws.onmessage = (e) => { if (typeof e.data === "string") return; // transcript const i16 = new Int16Array(e.data); const f32 = new Float32Array(i16.length); for (let i = 0; i < i16.length; i++) f32[i] = i16[i] / 0x7fff; const buf = ctx.createBuffer(1, f32.length, 24000); buf.copyToChannel(f32, 0); const s = ctx.createBufferSource(); s.buffer = buf; s.connect(ctx.destination); s.start(); }; ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Run it

```bash uvicorn app:app --host 0.0.0.0 --port 8000 --reload ```

Open http://localhost:8000 (serve the HTML separately or mount it on FastAPI), grant mic access, and start talking. The model should reply within ~1 second.

Common pitfalls

  • Sample rate mismatch: if your AudioContext is 48000 but you tell OpenAI 24000, voices sound chipmunk-fast. Set new AudioContext({ sampleRate: 24000 }).
  • Float32 sent as bytes: OpenAI rejects malformed audio silently. Always convert to Int16Array then send .buffer.
  • Not running both pumps concurrently: a serial loop will deadlock — use asyncio.gather.
  • additional_headers vs extra_headers: depends on your websockets lib version (>=12 uses additional_headers).

How CallSphere does this in production

CallSphere's Healthcare line uses this exact PCM16 24kHz pattern with server VAD at 0.55 — chosen because clinicians often pause mid-sentence and a stricter threshold cuts them off. After each call we run a post-call analytics job that scores sentiment (–1.0 to 1.0) and lead intent (0–100) from the transcript. The Salon vertical adds 4 specialist agents and ElevenLabs voices with GB-YYYYMMDD-### booking refs. See it live or start a trial.

FAQ

Why PCM16 24kHz instead of mu-law? Browsers can't encode mu-law cheaply, but PCM16 is one downsample step away from getUserMedia output. Mu-law is for telephony.

Can I use asyncio.create_task? Yes, but gather cancels both on exception, which is what you want.

How do I add streaming text output? Subscribe to response.audio_transcript.delta and forward strings — useful for live captions.

Production hosting? Deploy to Fly.io or k3s. Keep one process per region; FastAPI scales horizontally just fine.

Sources

Share
S

Written by

Sagar Shankaran· Founder, CallSphere

Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like