---
title: "How to Build a FastAPI WebSocket Voice Agent (Python) End-to-End"
description: "Stream microphone audio from a browser to FastAPI, fan out to OpenAI Realtime over WebSocket, and play model audio back — full Python tutorial with PCM16 24kHz."
canonical: https://callsphere.ai/blog/vw1h-build-fastapi-websocket-voice-agent-python-end-to-end
category: "AI Engineering"
tags: ["Tutorial", "Build", "FastAPI", "Python", "OpenAI Realtime"]
author: "CallSphere Team"
published: 2026-03-18T00:00:00.000Z
updated: 2026-05-07T06:44:59.893Z
---

# How to Build a FastAPI WebSocket Voice Agent (Python) End-to-End

> Stream microphone audio from a browser to FastAPI, fan out to OpenAI Realtime over WebSocket, and play model audio back — full Python tutorial with PCM16 24kHz.

> **TL;DR** — FastAPI's `websockets` route fits naturally between a browser microphone and OpenAI Realtime. Use PCM16 at 24kHz, run two async tasks per session, and you get a clean speech-to-speech loop in ~120 lines of Python.

## What you'll build

A FastAPI server that accepts a browser WebSocket carrying PCM16 24kHz audio chunks, forwards them to OpenAI Realtime, and streams model audio deltas back. A simple HTML page captures the microphone, downsamples to 24kHz Int16, and plays the response through the Web Audio API. End-to-end latency: 700–1100ms.

## Prerequisites

1. Python 3.11+ and `pip install fastapi uvicorn websockets`.
2. `OPENAI_API_KEY` exported in your shell.
3. Modern browser (Chrome/Safari) with microphone permission.
4. Basic Float32 → Int16 PCM understanding (browser ships Float32; OpenAI wants Int16).
5. Optional: `pip install python-dotenv` for env loading.

## Architecture

```mermaid
flowchart LR
  Mic[Browser Mic Float32] --> DS[Downsample 24kHz Int16]
  DS -- WS --> FA[FastAPI /ws]
  FA -- WS --> OA[OpenAI Realtime]
  OA -- audio.delta --> FA
  FA -- WS --> AP[AudioPlayer Web Audio]
```

## Step 1 — FastAPI WebSocket endpoint

```python

# app.py

import os, json, asyncio, base64, websockets
from fastapi import FastAPI, WebSocket
from fastapi.responses import HTMLResponse

app = FastAPI()
OPENAI_URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03"
HEADERS = {
    "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
    "OpenAI-Beta": "realtime=v1",
}
```

## Step 2 — Configure the OpenAI session

```python
SESSION = {
    "type": "session.update",
    "session": {
        "voice": "alloy",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {"model": "whisper-1"},
        "turn_detection": {"type": "server_vad", "threshold": 0.55,
                           "prefix_padding_ms": 300, "silence_duration_ms": 500},
        "instructions": "You are a concise voice assistant. Reply in 1-2 short sentences."
    }
}
```

## Step 3 — Bridge the two WebSockets

Use `asyncio.gather` so each direction runs independently. Don't await one before pumping the other — that's how you get echo and choppy audio.

```python
@app.websocket("/ws")
async def ws(client: WebSocket):
    await client.accept()
    async with websockets.connect(OPENAI_URL, additional_headers=HEADERS) as oai:
        await oai.send(json.dumps(SESSION))

```
    async def client_to_oai():
        try:
            while True:
                chunk = await client.receive_bytes()  # raw int16 PCM
                await oai.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(chunk).decode(),
                }))
        except Exception:
            pass

    async def oai_to_client():
        async for raw in oai:
            ev = json.loads(raw)
            if ev["type"] == "response.audio.delta":
                pcm = base64.b64decode(ev["delta"])
                await client.send_bytes(pcm)
            elif ev["type"] == "response.audio_transcript.done":
                await client.send_text(json.dumps({"role": "assistant",
                                                    "text": ev["transcript"]}))

    await asyncio.gather(client_to_oai(), oai_to_client())
```

```

## Step 4 — Browser microphone capture (Float32 → Int16 24kHz)

```html

```

## Step 5 — Play model audio in the browser

```js
ws.onmessage = (e) => {
  if (typeof e.data === "string") return; // transcript
  const i16 = new Int16Array(e.data);
  const f32 = new Float32Array(i16.length);
  for (let i = 0; i =12` uses `additional_headers`).

## How CallSphere does this in production

CallSphere's Healthcare line uses this exact PCM16 24kHz pattern with server VAD at 0.55 — chosen because clinicians often pause mid-sentence and a stricter threshold cuts them off. After each call we run a post-call analytics job that scores sentiment (–1.0 to 1.0) and lead intent (0–100) from the transcript. The Salon vertical adds 4 specialist agents and ElevenLabs voices with `GB-YYYYMMDD-###` booking refs. [See it live](/industries/healthcare) or [start a trial](/trial).

## FAQ

**Why PCM16 24kHz instead of mu-law?** Browsers can't encode mu-law cheaply, but PCM16 is one downsample step away from getUserMedia output. Mu-law is for telephony.

**Can I use `asyncio.create_task`?** Yes, but `gather` cancels both on exception, which is what you want.

**How do I add streaming text output?** Subscribe to `response.audio_transcript.delta` and forward strings — useful for live captions.

**Production hosting?** Deploy to Fly.io or k3s. Keep one process per region; FastAPI scales horizontally just fine.

## Sources

- [OpenAI Realtime guide](https://platform.openai.com/docs/guides/realtime)
- [FastAPI WebSockets](https://fastapi.tiangolo.com/advanced/websockets/)
- [Python websockets library](https://websockets.readthedocs.io/)
- [Web Audio API spec](https://www.w3.org/TR/webaudio/)

---

Source: https://callsphere.ai/blog/vw1h-build-fastapi-websocket-voice-agent-python-end-to-end
