Skip to content
Learn Agentic AI
Learn Agentic AI11 min read8 views

Building Voice Agents with WebRTC and OpenAI Realtime API

Build low-latency browser-based voice agents using WebRTC peer connections and OpenAI's Realtime API — from obtaining ephemeral tokens to establishing audio tracks and handling speech-to-speech interactions.

Why WebRTC for Voice Agents

The VoicePipeline approach we covered in previous posts runs the STT-Agent-TTS chain on your server. Every audio packet travels from the client to your server, then to OpenAI's API (for STT, LLM, and TTS), and back. Each network hop adds latency.

WebRTC eliminates the middleman. The browser establishes a direct peer connection with OpenAI's Realtime API servers. Audio flows over UDP with no intermediate server processing. The Realtime API uses a single multimodal model that accepts audio directly and produces audio directly — no separate STT or TTS steps.

The result is sub-300ms response times for voice interactions. The user speaks, and the agent responds almost instantly, creating a conversational experience that feels as natural as talking to another person.

Architecture Overview

The WebRTC voice agent architecture has three components:

flowchart TD
    START["Building Voice Agents with WebRTC and OpenAI Real…"] --> A
    A["Why WebRTC for Voice Agents"]
    A --> B
    B["Architecture Overview"]
    B --> C
    C["Step 1: Backend Ephemeral Key Endpoint"]
    C --> D
    D["Step 2: Browser WebRTC Client"]
    D --> E
    E["Step 3: Handling Data Channel Events"]
    E --> F
    F["Step 4: Function Calling Over WebRTC"]
    F --> G
    G["Stopping the Conversation"]
    G --> H
    H["Turn Detection with Server VAD"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
[Browser]                    [Your Server]              [OpenAI Realtime API]
    |                             |                              |
    |-- request ephemeral key --> |                              |
    |                             |-- create ephemeral key ----> |
    |                             |<-- ephemeral key ----------- |
    |<-- ephemeral key ---------- |                              |
    |                             |                              |
    |-- WebRTC offer -------------|----------------------------> |
    |<-- WebRTC answer -----------|----------------------------> |
    |                             |                              |
    |<========= direct audio over UDP (WebRTC) ===============> |
    |                             |                              |

Your backend server has one job: creating ephemeral API keys. You never want your real OpenAI API key exposed in browser JavaScript. The ephemeral key is short-lived (typically 60 seconds) and scoped to a single Realtime session.

Once the WebRTC connection is established, audio flows directly between the browser and OpenAI. Your server is out of the data path entirely.

Step 1: Backend Ephemeral Key Endpoint

Create a simple API endpoint that generates ephemeral keys:

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import httpx

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Restrict in production
    allow_methods=["POST"],
    allow_headers=["*"],
)

OPENAI_API_KEY = "sk-..."  # From environment variable in production

@app.post("/api/realtime/session")
async def create_realtime_session():
    """Create an ephemeral key for a Realtime API session."""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.openai.com/v1/realtime/sessions",
            headers={
                "Authorization": f"Bearer {OPENAI_API_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "model": "gpt-4o-realtime-preview",
                "voice": "nova",
                "instructions": "You are a helpful voice assistant. Keep responses concise.",
                "input_audio_transcription": {
                    "model": "whisper-1",
                },
            },
        )
        data = response.json()
        return {
            "client_secret": data["client_secret"]["value"],
            "session_id": data["id"],
        }

The client_secret is the ephemeral key. It is only valid for establishing a single WebRTC connection and expires quickly. The instructions and voice configure the Realtime session — these cannot be changed after the connection is established.

Step 2: Browser WebRTC Client

The browser side establishes the WebRTC connection and manages audio:

<!DOCTYPE html>
<html>
<head>
    <title>Voice Agent</title>
</head>
<body>
    <h1>Voice Agent</h1>
    <button id="startBtn">Start Conversation</button>
    <button id="stopBtn" disabled>Stop</button>
    <div id="status">Ready</div>
    <div id="transcript"></div>

    <script>
    let peerConnection = null;

    document.getElementById("startBtn").addEventListener("click", startConversation);
    document.getElementById("stopBtn").addEventListener("click", stopConversation);

    async function startConversation() {
        const statusEl = document.getElementById("status");
        statusEl.textContent = "Connecting...";

        // Step 1: Get ephemeral key from your backend
        const tokenResponse = await fetch("/api/realtime/session", {
            method: "POST",
        });
        const tokenData = await tokenResponse.json();
        const ephemeralKey = tokenData.client_secret;

        // Step 2: Create RTCPeerConnection
        peerConnection = new RTCPeerConnection();

        // Step 3: Set up audio output — agent's voice comes through here
        const audioElement = document.createElement("audio");
        audioElement.autoplay = true;
        document.body.appendChild(audioElement);

        peerConnection.ontrack = (event) => {
            audioElement.srcObject = event.streams[0];
        };

        // Step 4: Capture microphone and add audio track
        const mediaStream = await navigator.mediaDevices.getUserMedia({
            audio: {
                sampleRate: 24000,
                channelCount: 1,
                echoCancellation: true,
                noiseSuppression: true,
            },
        });
        mediaStream.getTracks().forEach((track) => {
            peerConnection.addTrack(track, mediaStream);
        });

        // Step 5: Create data channel for events
        const dataChannel = peerConnection.createDataChannel("oai-events");
        setupDataChannel(dataChannel);

        // Step 6: Create and set local offer
        const offer = await peerConnection.createOffer();
        await peerConnection.setLocalDescription(offer);

        // Step 7: Send offer to OpenAI Realtime API
        const sdpResponse = await fetch(
            "https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
            {
                method: "POST",
                headers: {
                    "Authorization": "Bearer " + ephemeralKey,
                    "Content-Type": "application/sdp",
                },
                body: offer.sdp,
            }
        );

        // Step 8: Set remote answer
        const answerSdp = await sdpResponse.text();
        await peerConnection.setRemoteDescription({
            type: "answer",
            sdp: answerSdp,
        });

        statusEl.textContent = "Connected — speak naturally";
        document.getElementById("startBtn").disabled = true;
        document.getElementById("stopBtn").disabled = false;
    }
    </script>
</body>
</html>

Let us walk through each step:

Steps 1-2 obtain the ephemeral key and create a WebRTC peer connection. The RTCPeerConnection is the browser API that manages the UDP-based audio channel.

Step 3 sets up audio output. When OpenAI sends audio back through the WebRTC connection, the browser receives it as a media stream track. Attaching it to an <audio> element plays it through the speakers automatically.

Step 4 captures the user's microphone. The getUserMedia API requests microphone access and returns a media stream. We add each track from this stream to the peer connection so it gets sent to OpenAI. The echoCancellation and noiseSuppression options are critical for preventing feedback loops.

Steps 5-8 complete the WebRTC signaling handshake. The browser creates an SDP (Session Description Protocol) offer describing its audio capabilities, sends it to OpenAI's Realtime endpoint, and receives an SDP answer. Once both sides have exchanged SDPs, the direct audio channel opens.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Step 3: Handling Data Channel Events

The data channel carries structured events alongside the audio stream — transcripts, function calls, errors, and session updates:

flowchart LR
    S0["Step 1: Backend Ephemeral Key Endpoint"]
    S0 --> S1
    S1["Step 2: Browser WebRTC Client"]
    S1 --> S2
    S2["Step 3: Handling Data Channel Events"]
    S2 --> S3
    S3["Step 4: Function Calling Over WebRTC"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S3 fill:#059669,stroke:#047857,color:#fff
function setupDataChannel(dataChannel) {
    const transcriptEl = document.getElementById("transcript");

    dataChannel.onopen = () => {
        console.log("Data channel open");

        // Optionally send a session update to configure behavior
        dataChannel.send(JSON.stringify({
            type: "session.update",
            session: {
                turn_detection: {
                    type: "server_vad",
                    threshold: 0.5,
                    prefix_padding_ms: 300,
                    silence_duration_ms: 500,
                },
            },
        }));
    };

    dataChannel.onmessage = (event) => {
        const data = JSON.parse(event.data);

        switch (data.type) {
            case "response.audio_transcript.delta":
                // Streaming transcript of agent's response
                transcriptEl.textContent += data.delta;
                break;

            case "response.audio_transcript.done":
                // Agent finished speaking
                transcriptEl.textContent += "\n";
                break;

            case "conversation.item.input_audio_transcription.completed":
                // What the user said (STT result)
                transcriptEl.textContent += "You: " + data.transcript + "\n";
                break;

            case "response.function_call_arguments.done":
                // The model wants to call a function
                handleFunctionCall(data, dataChannel);
                break;

            case "error":
                console.error("Realtime API error:", data.error);
                break;
        }
    };
}

The data channel event model is rich. The Realtime API streams response transcripts token by token (delta events), reports when the agent finishes a response (done events), and emits function call requests that your client can handle.

Step 4: Function Calling Over WebRTC

The Realtime API supports function calling, but with a difference: function execution happens on the client side (or on your server via the data channel). Here is how to handle function calls:

async function handleFunctionCall(data, dataChannel) {
    const functionName = data.name;
    const args = JSON.parse(data.arguments);
    const callId = data.call_id;

    let result;

    switch (functionName) {
        case "get_weather":
            result = await fetchWeather(args.city);
            break;
        case "lookup_order":
            result = await fetchOrderStatus(args.order_id);
            break;
        default:
            result = JSON.stringify({ error: "Unknown function" });
    }

    // Send the function result back through the data channel
    dataChannel.send(JSON.stringify({
        type: "conversation.item.create",
        item: {
            type: "function_call_output",
            call_id: callId,
            output: typeof result === "string" ? result : JSON.stringify(result),
        },
    }));

    // Tell the model to continue generating a response
    dataChannel.send(JSON.stringify({
        type: "response.create",
    }));
}

async function fetchWeather(city) {
    // Call your backend API
    const response = await fetch(`/api/weather?city=${encodeURIComponent(city)}`);
    const data = await response.json();
    return JSON.stringify(data);
}

The flow is: the model detects it needs to call a function, sends a function call event through the data channel, your JavaScript handles the call (often by hitting your backend API), sends the result back through the data channel, and tells the model to continue generating. The model then incorporates the function result into its spoken response.

To register functions with the session, send them during setup:

dataChannel.onopen = () => {
    dataChannel.send(JSON.stringify({
        type: "session.update",
        session: {
            tools: [
                {
                    type: "function",
                    name: "get_weather",
                    description: "Get current weather for a city",
                    parameters: {
                        type: "object",
                        properties: {
                            city: {
                                type: "string",
                                description: "City name",
                            },
                        },
                        required: ["city"],
                    },
                },
            ],
        },
    }));
};

Stopping the Conversation

Clean shutdown is important to release resources:

function stopConversation() {
    if (peerConnection) {
        // Stop all audio tracks
        peerConnection.getSenders().forEach((sender) => {
            if (sender.track) {
                sender.track.stop();
            }
        });

        // Close the peer connection
        peerConnection.close();
        peerConnection = null;
    }

    document.getElementById("status").textContent = "Disconnected";
    document.getElementById("startBtn").disabled = false;
    document.getElementById("stopBtn").disabled = true;
}

Stopping the media tracks releases the microphone. Closing the peer connection terminates the WebRTC session and the Realtime API session on OpenAI's side.

Turn Detection with Server VAD

The Realtime API includes server-side voice activity detection. When configured, the server automatically detects when the user starts and stops speaking, eliminating the need for client-side VAD:

dataChannel.send(JSON.stringify({
    type: "session.update",
    session: {
        turn_detection: {
            type: "server_vad",
            threshold: 0.5,
            prefix_padding_ms: 300,
            silence_duration_ms: 500,
        },
    },
}));

With server VAD enabled, the model automatically starts processing when it detects the user has finished speaking. No explicit "end of turn" signal is needed from the client. The user speaks, pauses, and the agent responds — the same natural flow as a phone call.

You can also disable server VAD and manage turns manually by sending input_audio_buffer.commit events through the data channel. This is useful for push-to-talk interfaces.

VoicePipeline vs Realtime API: Production Tradeoffs

Having now covered both approaches in detail, here is a summary of the production tradeoffs:

Latency: Realtime API wins. Sub-300ms vs 800ms+ for VoicePipeline. If your users are having real-time conversations and low latency is essential, use the Realtime API.

Agent complexity: VoicePipeline wins. It uses the full Agents SDK with native support for handoffs, guardrails, multi-agent workflows, and complex tool chains. The Realtime API supports function calling but lacks the orchestration layer.

Infrastructure control: VoicePipeline wins. Audio processing happens on your servers. You can log, record, analyze, and comply with regulations that require data to stay in your infrastructure.

Cost: Depends on usage. The Realtime API charges for audio tokens (audio input and output). VoicePipeline charges separately for STT, LLM, and TTS. For long conversations with short responses, VoicePipeline may be cheaper. For rapid back-and-forth exchanges, the Realtime API may be more cost-effective.

Browser support: Realtime API wins. WebRTC is natively supported in all modern browsers. VoicePipeline requires a server-side component and a WebSocket or similar transport to connect the browser.

Telephony integration: VoicePipeline wins. SIP and PSTN integrations work with server-side audio processing. WebRTC can work with telephony gateways but adds complexity.

Choose based on your highest-priority requirement. Many production systems use a hybrid: the Realtime API for the conversational interface and a VoicePipeline-based backend for complex processing tasks that get triggered by function calls.

Sources:

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like