Building Voice Agents with WebRTC and OpenAI Realtime API
Build low-latency browser-based voice agents using WebRTC peer connections and OpenAI's Realtime API — from obtaining ephemeral tokens to establishing audio tracks and handling speech-to-speech interactions.
Why WebRTC for Voice Agents
The VoicePipeline approach we covered in previous posts runs the STT-Agent-TTS chain on your server. Every audio packet travels from the client to your server, then to OpenAI's API (for STT, LLM, and TTS), and back. Each network hop adds latency.
WebRTC eliminates the middleman. The browser establishes a direct peer connection with OpenAI's Realtime API servers. Audio flows over UDP with no intermediate server processing. The Realtime API uses a single multimodal model that accepts audio directly and produces audio directly — no separate STT or TTS steps.
The result is sub-300ms response times for voice interactions. The user speaks, and the agent responds almost instantly, creating a conversational experience that feels as natural as talking to another person.
Architecture Overview
The WebRTC voice agent architecture has three components:
flowchart TD
START["Building Voice Agents with WebRTC and OpenAI Real…"] --> A
A["Why WebRTC for Voice Agents"]
A --> B
B["Architecture Overview"]
B --> C
C["Step 1: Backend Ephemeral Key Endpoint"]
C --> D
D["Step 2: Browser WebRTC Client"]
D --> E
E["Step 3: Handling Data Channel Events"]
E --> F
F["Step 4: Function Calling Over WebRTC"]
F --> G
G["Stopping the Conversation"]
G --> H
H["Turn Detection with Server VAD"]
H --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
[Browser] [Your Server] [OpenAI Realtime API]
| | |
|-- request ephemeral key --> | |
| |-- create ephemeral key ----> |
| |<-- ephemeral key ----------- |
|<-- ephemeral key ---------- | |
| | |
|-- WebRTC offer -------------|----------------------------> |
|<-- WebRTC answer -----------|----------------------------> |
| | |
|<========= direct audio over UDP (WebRTC) ===============> |
| | |
Your backend server has one job: creating ephemeral API keys. You never want your real OpenAI API key exposed in browser JavaScript. The ephemeral key is short-lived (typically 60 seconds) and scoped to a single Realtime session.
Once the WebRTC connection is established, audio flows directly between the browser and OpenAI. Your server is out of the data path entirely.
Step 1: Backend Ephemeral Key Endpoint
Create a simple API endpoint that generates ephemeral keys:
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import httpx
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Restrict in production
allow_methods=["POST"],
allow_headers=["*"],
)
OPENAI_API_KEY = "sk-..." # From environment variable in production
@app.post("/api/realtime/session")
async def create_realtime_session():
"""Create an ephemeral key for a Realtime API session."""
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/realtime/sessions",
headers={
"Authorization": f"Bearer {OPENAI_API_KEY}",
"Content-Type": "application/json",
},
json={
"model": "gpt-4o-realtime-preview",
"voice": "nova",
"instructions": "You are a helpful voice assistant. Keep responses concise.",
"input_audio_transcription": {
"model": "whisper-1",
},
},
)
data = response.json()
return {
"client_secret": data["client_secret"]["value"],
"session_id": data["id"],
}
The client_secret is the ephemeral key. It is only valid for establishing a single WebRTC connection and expires quickly. The instructions and voice configure the Realtime session — these cannot be changed after the connection is established.
Step 2: Browser WebRTC Client
The browser side establishes the WebRTC connection and manages audio:
<!DOCTYPE html>
<html>
<head>
<title>Voice Agent</title>
</head>
<body>
<h1>Voice Agent</h1>
<button id="startBtn">Start Conversation</button>
<button id="stopBtn" disabled>Stop</button>
<div id="status">Ready</div>
<div id="transcript"></div>
<script>
let peerConnection = null;
document.getElementById("startBtn").addEventListener("click", startConversation);
document.getElementById("stopBtn").addEventListener("click", stopConversation);
async function startConversation() {
const statusEl = document.getElementById("status");
statusEl.textContent = "Connecting...";
// Step 1: Get ephemeral key from your backend
const tokenResponse = await fetch("/api/realtime/session", {
method: "POST",
});
const tokenData = await tokenResponse.json();
const ephemeralKey = tokenData.client_secret;
// Step 2: Create RTCPeerConnection
peerConnection = new RTCPeerConnection();
// Step 3: Set up audio output — agent's voice comes through here
const audioElement = document.createElement("audio");
audioElement.autoplay = true;
document.body.appendChild(audioElement);
peerConnection.ontrack = (event) => {
audioElement.srcObject = event.streams[0];
};
// Step 4: Capture microphone and add audio track
const mediaStream = await navigator.mediaDevices.getUserMedia({
audio: {
sampleRate: 24000,
channelCount: 1,
echoCancellation: true,
noiseSuppression: true,
},
});
mediaStream.getTracks().forEach((track) => {
peerConnection.addTrack(track, mediaStream);
});
// Step 5: Create data channel for events
const dataChannel = peerConnection.createDataChannel("oai-events");
setupDataChannel(dataChannel);
// Step 6: Create and set local offer
const offer = await peerConnection.createOffer();
await peerConnection.setLocalDescription(offer);
// Step 7: Send offer to OpenAI Realtime API
const sdpResponse = await fetch(
"https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
{
method: "POST",
headers: {
"Authorization": "Bearer " + ephemeralKey,
"Content-Type": "application/sdp",
},
body: offer.sdp,
}
);
// Step 8: Set remote answer
const answerSdp = await sdpResponse.text();
await peerConnection.setRemoteDescription({
type: "answer",
sdp: answerSdp,
});
statusEl.textContent = "Connected — speak naturally";
document.getElementById("startBtn").disabled = true;
document.getElementById("stopBtn").disabled = false;
}
</script>
</body>
</html>
Let us walk through each step:
Steps 1-2 obtain the ephemeral key and create a WebRTC peer connection. The RTCPeerConnection is the browser API that manages the UDP-based audio channel.
Step 3 sets up audio output. When OpenAI sends audio back through the WebRTC connection, the browser receives it as a media stream track. Attaching it to an <audio> element plays it through the speakers automatically.
Step 4 captures the user's microphone. The getUserMedia API requests microphone access and returns a media stream. We add each track from this stream to the peer connection so it gets sent to OpenAI. The echoCancellation and noiseSuppression options are critical for preventing feedback loops.
Steps 5-8 complete the WebRTC signaling handshake. The browser creates an SDP (Session Description Protocol) offer describing its audio capabilities, sends it to OpenAI's Realtime endpoint, and receives an SDP answer. Once both sides have exchanged SDPs, the direct audio channel opens.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Step 3: Handling Data Channel Events
The data channel carries structured events alongside the audio stream — transcripts, function calls, errors, and session updates:
flowchart LR
S0["Step 1: Backend Ephemeral Key Endpoint"]
S0 --> S1
S1["Step 2: Browser WebRTC Client"]
S1 --> S2
S2["Step 3: Handling Data Channel Events"]
S2 --> S3
S3["Step 4: Function Calling Over WebRTC"]
style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
style S3 fill:#059669,stroke:#047857,color:#fff
function setupDataChannel(dataChannel) {
const transcriptEl = document.getElementById("transcript");
dataChannel.onopen = () => {
console.log("Data channel open");
// Optionally send a session update to configure behavior
dataChannel.send(JSON.stringify({
type: "session.update",
session: {
turn_detection: {
type: "server_vad",
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500,
},
},
}));
};
dataChannel.onmessage = (event) => {
const data = JSON.parse(event.data);
switch (data.type) {
case "response.audio_transcript.delta":
// Streaming transcript of agent's response
transcriptEl.textContent += data.delta;
break;
case "response.audio_transcript.done":
// Agent finished speaking
transcriptEl.textContent += "\n";
break;
case "conversation.item.input_audio_transcription.completed":
// What the user said (STT result)
transcriptEl.textContent += "You: " + data.transcript + "\n";
break;
case "response.function_call_arguments.done":
// The model wants to call a function
handleFunctionCall(data, dataChannel);
break;
case "error":
console.error("Realtime API error:", data.error);
break;
}
};
}
The data channel event model is rich. The Realtime API streams response transcripts token by token (delta events), reports when the agent finishes a response (done events), and emits function call requests that your client can handle.
Step 4: Function Calling Over WebRTC
The Realtime API supports function calling, but with a difference: function execution happens on the client side (or on your server via the data channel). Here is how to handle function calls:
async function handleFunctionCall(data, dataChannel) {
const functionName = data.name;
const args = JSON.parse(data.arguments);
const callId = data.call_id;
let result;
switch (functionName) {
case "get_weather":
result = await fetchWeather(args.city);
break;
case "lookup_order":
result = await fetchOrderStatus(args.order_id);
break;
default:
result = JSON.stringify({ error: "Unknown function" });
}
// Send the function result back through the data channel
dataChannel.send(JSON.stringify({
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id: callId,
output: typeof result === "string" ? result : JSON.stringify(result),
},
}));
// Tell the model to continue generating a response
dataChannel.send(JSON.stringify({
type: "response.create",
}));
}
async function fetchWeather(city) {
// Call your backend API
const response = await fetch(`/api/weather?city=${encodeURIComponent(city)}`);
const data = await response.json();
return JSON.stringify(data);
}
The flow is: the model detects it needs to call a function, sends a function call event through the data channel, your JavaScript handles the call (often by hitting your backend API), sends the result back through the data channel, and tells the model to continue generating. The model then incorporates the function result into its spoken response.
To register functions with the session, send them during setup:
dataChannel.onopen = () => {
dataChannel.send(JSON.stringify({
type: "session.update",
session: {
tools: [
{
type: "function",
name: "get_weather",
description: "Get current weather for a city",
parameters: {
type: "object",
properties: {
city: {
type: "string",
description: "City name",
},
},
required: ["city"],
},
},
],
},
}));
};
Stopping the Conversation
Clean shutdown is important to release resources:
function stopConversation() {
if (peerConnection) {
// Stop all audio tracks
peerConnection.getSenders().forEach((sender) => {
if (sender.track) {
sender.track.stop();
}
});
// Close the peer connection
peerConnection.close();
peerConnection = null;
}
document.getElementById("status").textContent = "Disconnected";
document.getElementById("startBtn").disabled = false;
document.getElementById("stopBtn").disabled = true;
}
Stopping the media tracks releases the microphone. Closing the peer connection terminates the WebRTC session and the Realtime API session on OpenAI's side.
Turn Detection with Server VAD
The Realtime API includes server-side voice activity detection. When configured, the server automatically detects when the user starts and stops speaking, eliminating the need for client-side VAD:
dataChannel.send(JSON.stringify({
type: "session.update",
session: {
turn_detection: {
type: "server_vad",
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500,
},
},
}));
With server VAD enabled, the model automatically starts processing when it detects the user has finished speaking. No explicit "end of turn" signal is needed from the client. The user speaks, pauses, and the agent responds — the same natural flow as a phone call.
You can also disable server VAD and manage turns manually by sending input_audio_buffer.commit events through the data channel. This is useful for push-to-talk interfaces.
VoicePipeline vs Realtime API: Production Tradeoffs
Having now covered both approaches in detail, here is a summary of the production tradeoffs:
Latency: Realtime API wins. Sub-300ms vs 800ms+ for VoicePipeline. If your users are having real-time conversations and low latency is essential, use the Realtime API.
Agent complexity: VoicePipeline wins. It uses the full Agents SDK with native support for handoffs, guardrails, multi-agent workflows, and complex tool chains. The Realtime API supports function calling but lacks the orchestration layer.
Infrastructure control: VoicePipeline wins. Audio processing happens on your servers. You can log, record, analyze, and comply with regulations that require data to stay in your infrastructure.
Cost: Depends on usage. The Realtime API charges for audio tokens (audio input and output). VoicePipeline charges separately for STT, LLM, and TTS. For long conversations with short responses, VoicePipeline may be cheaper. For rapid back-and-forth exchanges, the Realtime API may be more cost-effective.
Browser support: Realtime API wins. WebRTC is natively supported in all modern browsers. VoicePipeline requires a server-side component and a WebSocket or similar transport to connect the browser.
Telephony integration: VoicePipeline wins. SIP and PSTN integrations work with server-side audio processing. WebRTC can work with telephony gateways but adds complexity.
Choose based on your highest-priority requirement. Many production systems use a hybrid: the Realtime API for the conversational interface and a VoicePipeline-based backend for complex processing tasks that get triggered by function calls.
Sources:
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.