Build a Rust Voice Agent with axum, tokio, and a WebSocket Bridge
Stand up a production voice bridge in Rust: axum for routing, tokio-tungstenite for the OpenAI Realtime socket, and broadcast channels for fan-out. Real working code.
TL;DR — Rust handles 2.5x more concurrent voice sessions than Go on the same hardware. axum + tokio-tungstenite is the cleanest path to a production voice bridge that fronts OpenAI Realtime from your own backend.
What you'll build
A Rust HTTP server that accepts a browser WebSocket, opens a paired WebSocket to OpenAI Realtime, and pumps frames between them with backpressure-aware tokio channels. You'll add a simple keep-alive ping, base64 audio framing, and clean shutdown on either side dropping.
Prerequisites
- Rust 1.78+ (stable, 2024 edition).
cargo add axum tokio tokio-tungstenite futures-util serde serde_json.OPENAI_API_KEYin env, with Realtime access.- Familiarity with
async/awaitand tokio'sselect!. - A static frontend that POSTs PCM16 frames as base64 strings.
Architecture
sequenceDiagram
participant B as Browser
participant R as Rust axum
participant O as OpenAI Realtime
B->>R: WS /voice
R->>O: WS wss://api.openai.com/v1/realtime
B->>R: input_audio_buffer.append
R->>O: forward
O-->>R: response.audio.delta
R-->>B: forward
Step 1 — Cargo.toml deps
```toml [dependencies] axum = { version = "0.7", features = ["ws"] } tokio = { version = "1", features = ["full"] } tokio-tungstenite = { version = "0.23", features = ["native-tls"] } futures-util = "0.3" serde = { version = "1", features = ["derive"] } serde_json = "1" ```
Step 2 — axum router with WebSocket upgrade
```rust use axum::{extract::ws::{WebSocket, WebSocketUpgrade}, response::IntoResponse, routing::get, Router};
#[tokio::main] async fn main() { let app = Router::new().route("/voice", get(voice_ws)); let listener = tokio::net::TcpListener::bind("0.0.0.0:8080").await.unwrap(); axum::serve(listener, app).await.unwrap(); }
async fn voice_ws(ws: WebSocketUpgrade) -> impl IntoResponse { ws.on_upgrade(handle_socket) } ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — Open the OpenAI socket
tokio-tungstenite lets you set custom headers via a request builder, which Realtime requires for the OpenAI-Beta header.
```rust use tokio_tungstenite::{connect_async, tungstenite::handshake::client::Request};
async fn open_openai() -> tokio_tungstenite::WebSocketStream<...> { let key = std::env::var("OPENAI_API_KEY").unwrap(); let req = Request::builder() .uri("wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03") .header("Authorization", format!("Bearer {}", key)) .header("OpenAI-Beta", "realtime=v1") .body(()) .unwrap(); let (ws, _) = connect_async(req).await.expect("openai connect"); ws } ```
Step 4 — Bidirectional pump with select!
This is the core of the bridge. Two streams, two sinks, one select!. Whichever side speaks first wins the loop iteration; the other side waits.
```rust use futures_util::{SinkExt, StreamExt}; use axum::extract::ws::Message as AxMsg; use tokio_tungstenite::tungstenite::Message as OaMsg;
async fn handle_socket(socket: WebSocket) { let (mut bx_tx, mut bx_rx) = socket.split(); let oai = open_openai().await; let (mut oai_tx, mut oai_rx) = oai.split();
loop {
tokio::select! {
Some(Ok(msg)) = bx_rx.next() => {
if let AxMsg::Text(t) = msg {
let _ = oai_tx.send(OaMsg::Text(t)).await;
}
}
Some(Ok(msg)) = oai_rx.next() => {
if let OaMsg::Text(t) = msg {
let _ = bx_tx.send(AxMsg::Text(t)).await;
}
}
else => break,
}
}
} ```
Step 5 — Inject your system prompt on connect
```rust let session_update = serde_json::json!({ "type": "session.update", "session": { "instructions": "You are CallSphere's Rust-backed voice agent.", "voice": "alloy", "input_audio_format": "pcm16", "output_audio_format": "pcm16", "turn_detection": {"type":"server_vad","threshold":0.5} } }); oai_tx.send(OaMsg::Text(session_update.to_string())).await.unwrap(); ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 6 — Heartbeat and graceful shutdown
Without pings, idle proxies (CloudFront, Cloudflare) will reap the connection at 60s.
```rust let mut tick = tokio::time::interval(std::time::Duration::from_secs(20)); loop { tokio::select! { _ = tick.tick() => { let _ = bx_tx.send(AxMsg::Ping(vec![])).await; } // ...other arms } } ```
Common pitfalls
- Forgetting
OpenAI-Betaheader. Connection 401s with no helpful error. - Unbounded channels. Use
mpscwithbuffer(64)so a slow client doesn't OOM the server. - Mixed message types. Don't pass binary frames through if both sides expect text — Realtime base64 encodes audio.
- Panicking in spawn. Wrap with
tokio::spawn(async move { let _ = ... })and log errors.
How CallSphere does this in production
CallSphere's healthcare agent runs a Rust admission-router that sits in front of FastAPI :8084 voice workers. The router authenticates HIPAA-scoped JWTs, applies tenant rate limits, and dispatches to one of 37 specialised agents across 6 verticals — all backed by 115+ Postgres tables. See pricing — $149/$499/$1499 with a 14-day trial.
FAQ
Why not just use Node.js? Rust holds ~10MB RSS per session vs ~80MB on Node, and it doesn't GC-stutter mid-call.
Can I terminate WebRTC in Rust? Yes, with webrtc-rs, but it's heavier; for a Realtime bridge, WebSocket is enough.
What about TLS? Use rustls behind nginx, or terminate at Cloudflare.
Does axum 0.7 still ship? Yes, plus 0.8 — the API is stable; pinning 0.7 is fine for prod.
How many sessions per core? ~2k idle, ~600 with active audio at 50kbps each.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.