Skip to content
AI Engineering
AI Engineering13 min read0 views

Build a Rust Voice Agent with axum, tokio, and a WebSocket Bridge

Stand up a production voice bridge in Rust: axum for routing, tokio-tungstenite for the OpenAI Realtime socket, and broadcast channels for fan-out. Real working code.

TL;DR — Rust handles 2.5x more concurrent voice sessions than Go on the same hardware. axum + tokio-tungstenite is the cleanest path to a production voice bridge that fronts OpenAI Realtime from your own backend.

What you'll build

A Rust HTTP server that accepts a browser WebSocket, opens a paired WebSocket to OpenAI Realtime, and pumps frames between them with backpressure-aware tokio channels. You'll add a simple keep-alive ping, base64 audio framing, and clean shutdown on either side dropping.

Prerequisites

  1. Rust 1.78+ (stable, 2024 edition).
  2. cargo add axum tokio tokio-tungstenite futures-util serde serde_json.
  3. OPENAI_API_KEY in env, with Realtime access.
  4. Familiarity with async/await and tokio's select!.
  5. A static frontend that POSTs PCM16 frames as base64 strings.

Architecture

sequenceDiagram
  participant B as Browser
  participant R as Rust axum
  participant O as OpenAI Realtime
  B->>R: WS /voice
  R->>O: WS wss://api.openai.com/v1/realtime
  B->>R: input_audio_buffer.append
  R->>O: forward
  O-->>R: response.audio.delta
  R-->>B: forward

Step 1 — Cargo.toml deps

```toml [dependencies] axum = { version = "0.7", features = ["ws"] } tokio = { version = "1", features = ["full"] } tokio-tungstenite = { version = "0.23", features = ["native-tls"] } futures-util = "0.3" serde = { version = "1", features = ["derive"] } serde_json = "1" ```

Step 2 — axum router with WebSocket upgrade

```rust use axum::{extract::ws::{WebSocket, WebSocketUpgrade}, response::IntoResponse, routing::get, Router};

#[tokio::main] async fn main() { let app = Router::new().route("/voice", get(voice_ws)); let listener = tokio::net::TcpListener::bind("0.0.0.0:8080").await.unwrap(); axum::serve(listener, app).await.unwrap(); }

async fn voice_ws(ws: WebSocketUpgrade) -> impl IntoResponse { ws.on_upgrade(handle_socket) } ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Open the OpenAI socket

tokio-tungstenite lets you set custom headers via a request builder, which Realtime requires for the OpenAI-Beta header.

```rust use tokio_tungstenite::{connect_async, tungstenite::handshake::client::Request};

async fn open_openai() -> tokio_tungstenite::WebSocketStream<...> { let key = std::env::var("OPENAI_API_KEY").unwrap(); let req = Request::builder() .uri("wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03") .header("Authorization", format!("Bearer {}", key)) .header("OpenAI-Beta", "realtime=v1") .body(()) .unwrap(); let (ws, _) = connect_async(req).await.expect("openai connect"); ws } ```

Step 4 — Bidirectional pump with select!

This is the core of the bridge. Two streams, two sinks, one select!. Whichever side speaks first wins the loop iteration; the other side waits.

```rust use futures_util::{SinkExt, StreamExt}; use axum::extract::ws::Message as AxMsg; use tokio_tungstenite::tungstenite::Message as OaMsg;

async fn handle_socket(socket: WebSocket) { let (mut bx_tx, mut bx_rx) = socket.split(); let oai = open_openai().await; let (mut oai_tx, mut oai_rx) = oai.split();

loop {
    tokio::select! {
        Some(Ok(msg)) = bx_rx.next() => {
            if let AxMsg::Text(t) = msg {
                let _ = oai_tx.send(OaMsg::Text(t)).await;
            }
        }
        Some(Ok(msg)) = oai_rx.next() => {
            if let OaMsg::Text(t) = msg {
                let _ = bx_tx.send(AxMsg::Text(t)).await;
            }
        }
        else => break,
    }
}

} ```

Step 5 — Inject your system prompt on connect

```rust let session_update = serde_json::json!({ "type": "session.update", "session": { "instructions": "You are CallSphere's Rust-backed voice agent.", "voice": "alloy", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "turn_detection": {"type":"server_vad","threshold":0.5} } }); oai_tx.send(OaMsg::Text(session_update.to_string())).await.unwrap(); ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Heartbeat and graceful shutdown

Without pings, idle proxies (CloudFront, Cloudflare) will reap the connection at 60s.

```rust let mut tick = tokio::time::interval(std::time::Duration::from_secs(20)); loop { tokio::select! { _ = tick.tick() => { let _ = bx_tx.send(AxMsg::Ping(vec![])).await; } // ...other arms } } ```

Common pitfalls

  • Forgetting OpenAI-Beta header. Connection 401s with no helpful error.
  • Unbounded channels. Use mpsc with buffer(64) so a slow client doesn't OOM the server.
  • Mixed message types. Don't pass binary frames through if both sides expect text — Realtime base64 encodes audio.
  • Panicking in spawn. Wrap with tokio::spawn(async move { let _ = ... }) and log errors.

How CallSphere does this in production

CallSphere's healthcare agent runs a Rust admission-router that sits in front of FastAPI :8084 voice workers. The router authenticates HIPAA-scoped JWTs, applies tenant rate limits, and dispatches to one of 37 specialised agents across 6 verticals — all backed by 115+ Postgres tables. See pricing — $149/$499/$1499 with a 14-day trial.

FAQ

Why not just use Node.js? Rust holds ~10MB RSS per session vs ~80MB on Node, and it doesn't GC-stutter mid-call.

Can I terminate WebRTC in Rust? Yes, with webrtc-rs, but it's heavier; for a Realtime bridge, WebSocket is enough.

What about TLS? Use rustls behind nginx, or terminate at Cloudflare.

Does axum 0.7 still ship? Yes, plus 0.8 — the API is stable; pinning 0.7 is fine for prod.

How many sessions per core? ~2k idle, ~600 with active audio at 50kbps each.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.