---
title: "Building Voice Agents with WebRTC and OpenAI Realtime API"
description: "Build low-latency browser-based voice agents using WebRTC peer connections and OpenAI's Realtime API — from obtaining ephemeral tokens to establishing audio tracks and handling speech-to-speech interactions."
canonical: https://callsphere.ai/blog/voice-agents-webrtc-openai-realtime-api-browser
category: "Learn Agentic AI"
tags: ["OpenAI", "WebRTC", "Realtime API", "Browser", "Voice"]
author: "CallSphere Team"
published: 2026-03-14T00:00:00.000Z
updated: 2026-05-07T18:04:56.022Z
---

# Building Voice Agents with WebRTC and OpenAI Realtime API

> Build low-latency browser-based voice agents using WebRTC peer connections and OpenAI's Realtime API — from obtaining ephemeral tokens to establishing audio tracks and handling speech-to-speech interactions.

## Why WebRTC for Voice Agents

The VoicePipeline approach we covered in previous posts runs the STT-Agent-TTS chain on your server. Every audio packet travels from the client to your server, then to OpenAI's API (for STT, LLM, and TTS), and back. Each network hop adds latency.

WebRTC eliminates the middleman. The browser establishes a direct peer connection with OpenAI's Realtime API servers. Audio flows over UDP with no intermediate server processing. The Realtime API uses a single multimodal model that accepts audio directly and produces audio directly — no separate STT or TTS steps.

The result is sub-300ms response times for voice interactions. The user speaks, and the agent responds almost instantly, creating a conversational experience that feels as natural as talking to another person.

## Architecture Overview

The WebRTC voice agent architecture has three components:

```mermaid
flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT
Deepgram or Whisper"]
        NLU{"Intent and
Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS
ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and
Schedule")]
        KB[("Knowledge Base
and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS  CRM
    TOOLS  CAL
    TOOLS  KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
```

```
[Browser]                    [Your Server]              [OpenAI Realtime API]
    |                             |                              |
    |-- request ephemeral key --> |                              |
    |                             |-- create ephemeral key ----> |
    |                             | |
    | |
    |                             |                              |
    | |
    |                             |                              |
```

Your backend server has one job: creating ephemeral API keys. You never want your real OpenAI API key exposed in browser JavaScript. The ephemeral key is short-lived (typically 60 seconds) and scoped to a single Realtime session.

Once the WebRTC connection is established, audio flows directly between the browser and OpenAI. Your server is out of the data path entirely.

## Step 1: Backend Ephemeral Key Endpoint

Create a simple API endpoint that generates ephemeral keys:

```python
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import httpx

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Restrict in production
    allow_methods=["POST"],
    allow_headers=["*"],
)

OPENAI_API_KEY = "sk-..."  # From environment variable in production

@app.post("/api/realtime/session")
async def create_realtime_session():
    """Create an ephemeral key for a Realtime API session."""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.openai.com/v1/realtime/sessions",
            headers={
                "Authorization": f"Bearer {OPENAI_API_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "model": "gpt-4o-realtime-preview",
                "voice": "nova",
                "instructions": "You are a helpful voice assistant. Keep responses concise.",
                "input_audio_transcription": {
                    "model": "whisper-1",
                },
            },
        )
        data = response.json()
        return {
            "client_secret": data["client_secret"]["value"],
            "session_id": data["id"],
        }
```

The `client_secret` is the ephemeral key. It is only valid for establishing a single WebRTC connection and expires quickly. The `instructions` and `voice` configure the Realtime session — these cannot be changed after the connection is established.

## Step 2: Browser WebRTC Client

The browser side establishes the WebRTC connection and manages audio:

```html

    Voice Agent

# Voice Agent

    Start Conversation
    Stop
    Ready

    let peerConnection = null;

    document.getElementById("startBtn").addEventListener("click", startConversation);
    document.getElementById("stopBtn").addEventListener("click", stopConversation);

    async function startConversation() {
        const statusEl = document.getElementById("status");
        statusEl.textContent = "Connecting...";

        // Step 1: Get ephemeral key from your backend
        const tokenResponse = await fetch("/api/realtime/session", {
            method: "POST",
        });
        const tokenData = await tokenResponse.json();
        const ephemeralKey = tokenData.client_secret;

        // Step 2: Create RTCPeerConnection
        peerConnection = new RTCPeerConnection();

        // Step 3: Set up audio output — agent's voice comes through here
        const audioElement = document.createElement("audio");
        audioElement.autoplay = true;
        document.body.appendChild(audioElement);

        peerConnection.ontrack = (event) => {
            audioElement.srcObject = event.streams[0];
        };

        // Step 4: Capture microphone and add audio track
        const mediaStream = await navigator.mediaDevices.getUserMedia({
            audio: {
                sampleRate: 24000,
                channelCount: 1,
                echoCancellation: true,
                noiseSuppression: true,
            },
        });
        mediaStream.getTracks().forEach((track) => {
            peerConnection.addTrack(track, mediaStream);
        });

        // Step 5: Create data channel for events
        const dataChannel = peerConnection.createDataChannel("oai-events");
        setupDataChannel(dataChannel);

        // Step 6: Create and set local offer
        const offer = await peerConnection.createOffer();
        await peerConnection.setLocalDescription(offer);

        // Step 7: Send offer to OpenAI Realtime API
        const sdpResponse = await fetch(
            "https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
            {
                method: "POST",
                headers: {
                    "Authorization": "Bearer " + ephemeralKey,
                    "Content-Type": "application/sdp",
                },
                body: offer.sdp,
            }
        );

        // Step 8: Set remote answer
        const answerSdp = await sdpResponse.text();
        await peerConnection.setRemoteDescription({
            type: "answer",
            sdp: answerSdp,
        });

        statusEl.textContent = "Connected — speak naturally";
        document.getElementById("startBtn").disabled = true;
        document.getElementById("stopBtn").disabled = false;
    }

```

Let us walk through each step:

**Steps 1-2** obtain the ephemeral key and create a WebRTC peer connection. The `RTCPeerConnection` is the browser API that manages the UDP-based audio channel.

**Step 3** sets up audio output. When OpenAI sends audio back through the WebRTC connection, the browser receives it as a media stream track. Attaching it to an `` element plays it through the speakers automatically.

**Step 4** captures the user's microphone. The `getUserMedia` API requests microphone access and returns a media stream. We add each track from this stream to the peer connection so it gets sent to OpenAI. The `echoCancellation` and `noiseSuppression` options are critical for preventing feedback loops.

**Steps 5-8** complete the WebRTC signaling handshake. The browser creates an SDP (Session Description Protocol) offer describing its audio capabilities, sends it to OpenAI's Realtime endpoint, and receives an SDP answer. Once both sides have exchanged SDPs, the direct audio channel opens.

## Step 3: Handling Data Channel Events

The data channel carries structured events alongside the audio stream — transcripts, function calls, errors, and session updates:

```javascript
function setupDataChannel(dataChannel) {
    const transcriptEl = document.getElementById("transcript");

    dataChannel.onopen = () => {
        console.log("Data channel open");

        // Optionally send a session update to configure behavior
        dataChannel.send(JSON.stringify({
            type: "session.update",
            session: {
                turn_detection: {
                    type: "server_vad",
                    threshold: 0.5,
                    prefix_padding_ms: 300,
                    silence_duration_ms: 500,
                },
            },
        }));
    };

    dataChannel.onmessage = (event) => {
        const data = JSON.parse(event.data);

        switch (data.type) {
            case "response.audio_transcript.delta":
                // Streaming transcript of agent's response
                transcriptEl.textContent += data.delta;
                break;

            case "response.audio_transcript.done":
                // Agent finished speaking
                transcriptEl.textContent += "\n";
                break;

            case "conversation.item.input_audio_transcription.completed":
                // What the user said (STT result)
                transcriptEl.textContent += "You: " + data.transcript + "\n";
                break;

            case "response.function_call_arguments.done":
                // The model wants to call a function
                handleFunctionCall(data, dataChannel);
                break;

            case "error":
                console.error("Realtime API error:", data.error);
                break;
        }
    };
}
```

The data channel event model is rich. The Realtime API streams response transcripts token by token (`delta` events), reports when the agent finishes a response (`done` events), and emits function call requests that your client can handle.

## Step 4: Function Calling Over WebRTC

The Realtime API supports function calling, but with a difference: function execution happens on the client side (or on your server via the data channel). Here is how to handle function calls:

```javascript
async function handleFunctionCall(data, dataChannel) {
    const functionName = data.name;
    const args = JSON.parse(data.arguments);
    const callId = data.call_id;

    let result;

    switch (functionName) {
        case "get_weather":
            result = await fetchWeather(args.city);
            break;
        case "lookup_order":
            result = await fetchOrderStatus(args.order_id);
            break;
        default:
            result = JSON.stringify({ error: "Unknown function" });
    }

    // Send the function result back through the data channel
    dataChannel.send(JSON.stringify({
        type: "conversation.item.create",
        item: {
            type: "function_call_output",
            call_id: callId,
            output: typeof result === "string" ? result : JSON.stringify(result),
        },
    }));

    // Tell the model to continue generating a response
    dataChannel.send(JSON.stringify({
        type: "response.create",
    }));
}

async function fetchWeather(city) {
    // Call your backend API
    const response = await fetch(`/api/weather?city=${encodeURIComponent(city)}`);
    const data = await response.json();
    return JSON.stringify(data);
}
```

The flow is: the model detects it needs to call a function, sends a function call event through the data channel, your JavaScript handles the call (often by hitting your backend API), sends the result back through the data channel, and tells the model to continue generating. The model then incorporates the function result into its spoken response.

To register functions with the session, send them during setup:

```javascript
dataChannel.onopen = () => {
    dataChannel.send(JSON.stringify({
        type: "session.update",
        session: {
            tools: [
                {
                    type: "function",
                    name: "get_weather",
                    description: "Get current weather for a city",
                    parameters: {
                        type: "object",
                        properties: {
                            city: {
                                type: "string",
                                description: "City name",
                            },
                        },
                        required: ["city"],
                    },
                },
            ],
        },
    }));
};
```

## Stopping the Conversation

Clean shutdown is important to release resources:

```javascript
function stopConversation() {
    if (peerConnection) {
        // Stop all audio tracks
        peerConnection.getSenders().forEach((sender) => {
            if (sender.track) {
                sender.track.stop();
            }
        });

        // Close the peer connection
        peerConnection.close();
        peerConnection = null;
    }

    document.getElementById("status").textContent = "Disconnected";
    document.getElementById("startBtn").disabled = false;
    document.getElementById("stopBtn").disabled = true;
}
```

Stopping the media tracks releases the microphone. Closing the peer connection terminates the WebRTC session and the Realtime API session on OpenAI's side.

## Turn Detection with Server VAD

The Realtime API includes server-side voice activity detection. When configured, the server automatically detects when the user starts and stops speaking, eliminating the need for client-side VAD:

```javascript
dataChannel.send(JSON.stringify({
    type: "session.update",
    session: {
        turn_detection: {
            type: "server_vad",
            threshold: 0.5,
            prefix_padding_ms: 300,
            silence_duration_ms: 500,
        },
    },
}));
```

With server VAD enabled, the model automatically starts processing when it detects the user has finished speaking. No explicit "end of turn" signal is needed from the client. The user speaks, pauses, and the agent responds — the same natural flow as a phone call.

You can also disable server VAD and manage turns manually by sending `input_audio_buffer.commit` events through the data channel. This is useful for push-to-talk interfaces.

## VoicePipeline vs Realtime API: Production Tradeoffs

Having now covered both approaches in detail, here is a summary of the production tradeoffs:

**Latency**: Realtime API wins. Sub-300ms vs 800ms+ for VoicePipeline. If your users are having real-time conversations and low latency is essential, use the Realtime API.

**Agent complexity**: VoicePipeline wins. It uses the full Agents SDK with native support for handoffs, guardrails, multi-agent workflows, and complex tool chains. The Realtime API supports function calling but lacks the orchestration layer.

**Infrastructure control**: VoicePipeline wins. Audio processing happens on your servers. You can log, record, analyze, and comply with regulations that require data to stay in your infrastructure.

**Cost**: Depends on usage. The Realtime API charges for audio tokens (audio input and output). VoicePipeline charges separately for STT, LLM, and TTS. For long conversations with short responses, VoicePipeline may be cheaper. For rapid back-and-forth exchanges, the Realtime API may be more cost-effective.

**Browser support**: Realtime API wins. WebRTC is natively supported in all modern browsers. VoicePipeline requires a server-side component and a WebSocket or similar transport to connect the browser.

**Telephony integration**: VoicePipeline wins. SIP and PSTN integrations work with server-side audio processing. WebRTC can work with telephony gateways but adds complexity.

Choose based on your highest-priority requirement. Many production systems use a hybrid: the Realtime API for the conversational interface and a VoicePipeline-based backend for complex processing tasks that get triggered by function calls.

**Sources:**

- [https://platform.openai.com/docs/guides/realtime-webrtc](https://platform.openai.com/docs/guides/realtime-webrtc)
- [https://platform.openai.com/docs/guides/realtime](https://platform.openai.com/docs/guides/realtime)
- [https://platform.openai.com/docs/api-reference/realtime](https://platform.openai.com/docs/api-reference/realtime)

---

Source: https://callsphere.ai/blog/voice-agents-webrtc-openai-realtime-api-browser
