Why WebRTC for Voice Agents

The VoicePipeline approach we covered in previous posts runs the STT-Agent-TTS chain on your server. Every audio packet travels from the client to your server, then to OpenAI's API (for STT, LLM, and TTS), and back. Each network hop adds latency.

WebRTC eliminates the middleman. The browser establishes a direct peer connection with OpenAI's Realtime API servers. Audio flows over UDP with no intermediate server processing. The Realtime API uses a single multimodal model that accepts audio directly and produces audio directly — no separate STT or TTS steps.

The result is sub-300ms response times for voice interactions. The user speaks, and the agent responds almost instantly, creating a conversational experience that feels as natural as talking to another person.

Architecture Overview

The WebRTC voice agent architecture has three components:

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937

[Browser]                    [Your Server]              [OpenAI Realtime API]
    |                             |                              |
    |-- request ephemeral key --> |                              |
    |                             |-- create ephemeral key ----> |
    |                             |<-- ephemeral key ----------- |
    |<-- ephemeral key ---------- |                              |
    |                             |                              |
    |-- WebRTC offer -------------|----------------------------> |
    |<-- WebRTC answer -----------|----------------------------> |
    |                             |                              |
    |<========= direct audio over UDP (WebRTC) ===============> |
    |                             |                              |

Your backend server has one job: creating ephemeral API keys. You never want your real OpenAI API key exposed in browser JavaScript. The ephemeral key is short-lived (typically 60 seconds) and scoped to a single Realtime session.

Once the WebRTC connection is established, audio flows directly between the browser and OpenAI. Your server is out of the data path entirely.

Step 1: Backend Ephemeral Key Endpoint

Create a simple API endpoint that generates ephemeral keys:

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import httpx

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # Restrict in production
    allow_methods=["POST"],
    allow_headers=["*"],
)

OPENAI_API_KEY = "sk-..."  # From environment variable in production

@app.post("/api/realtime/session")
async def create_realtime_session():
    """Create an ephemeral key for a Realtime API session."""
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.openai.com/v1/realtime/sessions",
            headers={
                "Authorization": f"Bearer {OPENAI_API_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "model": "gpt-4o-realtime-preview",
                "voice": "nova",
                "instructions": "You are a helpful voice assistant. Keep responses concise.",
                "input_audio_transcription": {
                    "model": "whisper-1",
                },
            },
        )
        data = response.json()
        return {
            "client_secret": data["client_secret"]["value"],
            "session_id": data["id"],
        }

The client_secret is the ephemeral key. It is only valid for establishing a single WebRTC connection and expires quickly. The instructions and voice configure the Realtime session — these cannot be changed after the connection is established.

Step 2: Browser WebRTC Client

The browser side establishes the WebRTC connection and manages audio:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

<!DOCTYPE html>
<html>
<head>
    <title>Voice Agent</title>
</head>
<body>
    <h1>Voice Agent</h1>
    <button id="startBtn">Start Conversation</button>
    <button id="stopBtn" disabled>Stop</button>
    <div id="status">Ready</div>
    <div id="transcript"></div>

    <script>
    let peerConnection = null;

    document.getElementById("startBtn").addEventListener("click", startConversation);
    document.getElementById("stopBtn").addEventListener("click", stopConversation);

    async function startConversation() {
        const statusEl = document.getElementById("status");
        statusEl.textContent = "Connecting...";

        // Step 1: Get ephemeral key from your backend
        const tokenResponse = await fetch("/api/realtime/session", {
            method: "POST",
        });
        const tokenData = await tokenResponse.json();
        const ephemeralKey = tokenData.client_secret;

        // Step 2: Create RTCPeerConnection
        peerConnection = new RTCPeerConnection();

        // Step 3: Set up audio output — agent's voice comes through here
        const audioElement = document.createElement("audio");
        audioElement.autoplay = true;
        document.body.appendChild(audioElement);

        peerConnection.ontrack = (event) => {
            audioElement.srcObject = event.streams[0];
        };

        // Step 4: Capture microphone and add audio track
        const mediaStream = await navigator.mediaDevices.getUserMedia({
            audio: {
                sampleRate: 24000,
                channelCount: 1,
                echoCancellation: true,
                noiseSuppression: true,
            },
        });
        mediaStream.getTracks().forEach((track) => {
            peerConnection.addTrack(track, mediaStream);
        });

        // Step 5: Create data channel for events
        const dataChannel = peerConnection.createDataChannel("oai-events");
        setupDataChannel(dataChannel);

        // Step 6: Create and set local offer
        const offer = await peerConnection.createOffer();
        await peerConnection.setLocalDescription(offer);

        // Step 7: Send offer to OpenAI Realtime API
        const sdpResponse = await fetch(
            "https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview",
            {
                method: "POST",
                headers: {
                    "Authorization": "Bearer " + ephemeralKey,
                    "Content-Type": "application/sdp",
                },
                body: offer.sdp,
            }
        );

        // Step 8: Set remote answer
        const answerSdp = await sdpResponse.text();
        await peerConnection.setRemoteDescription({
            type: "answer",
            sdp: answerSdp,
        });

        statusEl.textContent = "Connected — speak naturally";
        document.getElementById("startBtn").disabled = true;
        document.getElementById("stopBtn").disabled = false;
    }
    </script>
</body>
</html>

Let us walk through each step:

Steps 1-2 obtain the ephemeral key and create a WebRTC peer connection. The RTCPeerConnection is the browser API that manages the UDP-based audio channel.

Step 3 sets up audio output. When OpenAI sends audio back through the WebRTC connection, the browser receives it as a media stream track. Attaching it to an <audio> element plays it through the speakers automatically.

Step 4 captures the user's microphone. The getUserMedia API requests microphone access and returns a media stream. We add each track from this stream to the peer connection so it gets sent to OpenAI. The echoCancellation and noiseSuppression options are critical for preventing feedback loops.

Steps 5-8 complete the WebRTC signaling handshake. The browser creates an SDP (Session Description Protocol) offer describing its audio capabilities, sends it to OpenAI's Realtime endpoint, and receives an SDP answer. Once both sides have exchanged SDPs, the direct audio channel opens.

Step 3: Handling Data Channel Events

The data channel carries structured events alongside the audio stream — transcripts, function calls, errors, and session updates:

function setupDataChannel(dataChannel) {
    const transcriptEl = document.getElementById("transcript");

    dataChannel.onopen = () => {
        console.log("Data channel open");

        // Optionally send a session update to configure behavior
        dataChannel.send(JSON.stringify({
            type: "session.update",
            session: {
                turn_detection: {
                    type: "server_vad",
                    threshold: 0.5,
                    prefix_padding_ms: 300,
                    silence_duration_ms: 500,
                },
            },
        }));
    };

    dataChannel.onmessage = (event) => {
        const data = JSON.parse(event.data);

        switch (data.type) {
            case "response.audio_transcript.delta":
                // Streaming transcript of agent's response
                transcriptEl.textContent += data.delta;
                break;

            case "response.audio_transcript.done":
                // Agent finished speaking
                transcriptEl.textContent += "\n";
                break;

            case "conversation.item.input_audio_transcription.completed":
                // What the user said (STT result)
                transcriptEl.textContent += "You: " + data.transcript + "\n";
                break;

            case "response.function_call_arguments.done":
                // The model wants to call a function
                handleFunctionCall(data, dataChannel);
                break;

            case "error":
                console.error("Realtime API error:", data.error);
                break;
        }
    };
}

The data channel event model is rich. The Realtime API streams response transcripts token by token (delta events), reports when the agent finishes a response (done events), and emits function call requests that your client can handle.

Step 4: Function Calling Over WebRTC

The Realtime API supports function calling, but with a difference: function execution happens on the client side (or on your server via the data channel). Here is how to handle function calls:

async function handleFunctionCall(data, dataChannel) {
    const functionName = data.name;
    const args = JSON.parse(data.arguments);
    const callId = data.call_id;

    let result;

    switch (functionName) {
        case "get_weather":
            result = await fetchWeather(args.city);
            break;
        case "lookup_order":
            result = await fetchOrderStatus(args.order_id);
            break;
        default:
            result = JSON.stringify({ error: "Unknown function" });
    }

    // Send the function result back through the data channel
    dataChannel.send(JSON.stringify({
        type: "conversation.item.create",
        item: {
            type: "function_call_output",
            call_id: callId,
            output: typeof result === "string" ? result : JSON.stringify(result),
        },
    }));

    // Tell the model to continue generating a response
    dataChannel.send(JSON.stringify({
        type: "response.create",
    }));
}

async function fetchWeather(city) {
    // Call your backend API
    const response = await fetch(`/api/weather?city=${encodeURIComponent(city)}`);
    const data = await response.json();
    return JSON.stringify(data);
}

The flow is: the model detects it needs to call a function, sends a function call event through the data channel, your JavaScript handles the call (often by hitting your backend API), sends the result back through the data channel, and tells the model to continue generating. The model then incorporates the function result into its spoken response.

To register functions with the session, send them during setup:

dataChannel.onopen = () => {
    dataChannel.send(JSON.stringify({
        type: "session.update",
        session: {
            tools: [
                {
                    type: "function",
                    name: "get_weather",
                    description: "Get current weather for a city",
                    parameters: {
                        type: "object",
                        properties: {
                            city: {
                                type: "string",
                                description: "City name",
                            },
                        },
                        required: ["city"],
                    },
                },
            ],
        },
    }));
};

Stopping the Conversation

Clean shutdown is important to release resources:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

function stopConversation() {
    if (peerConnection) {
        // Stop all audio tracks
        peerConnection.getSenders().forEach((sender) => {
            if (sender.track) {
                sender.track.stop();
            }
        });

        // Close the peer connection
        peerConnection.close();
        peerConnection = null;
    }

    document.getElementById("status").textContent = "Disconnected";
    document.getElementById("startBtn").disabled = false;
    document.getElementById("stopBtn").disabled = true;
}

Stopping the media tracks releases the microphone. Closing the peer connection terminates the WebRTC session and the Realtime API session on OpenAI's side.

Turn Detection with Server VAD

The Realtime API includes server-side voice activity detection. When configured, the server automatically detects when the user starts and stops speaking, eliminating the need for client-side VAD:

dataChannel.send(JSON.stringify({
    type: "session.update",
    session: {
        turn_detection: {
            type: "server_vad",
            threshold: 0.5,
            prefix_padding_ms: 300,
            silence_duration_ms: 500,
        },
    },
}));

With server VAD enabled, the model automatically starts processing when it detects the user has finished speaking. No explicit "end of turn" signal is needed from the client. The user speaks, pauses, and the agent responds — the same natural flow as a phone call.

You can also disable server VAD and manage turns manually by sending input_audio_buffer.commit events through the data channel. This is useful for push-to-talk interfaces.

VoicePipeline vs Realtime API: Production Tradeoffs

Having now covered both approaches in detail, here is a summary of the production tradeoffs:

Latency: Realtime API wins. Sub-300ms vs 800ms+ for VoicePipeline. If your users are having real-time conversations and low latency is essential, use the Realtime API.

Agent complexity: VoicePipeline wins. It uses the full Agents SDK with native support for handoffs, guardrails, multi-agent workflows, and complex tool chains. The Realtime API supports function calling but lacks the orchestration layer.

Infrastructure control: VoicePipeline wins. Audio processing happens on your servers. You can log, record, analyze, and comply with regulations that require data to stay in your infrastructure.

Cost: Depends on usage. The Realtime API charges for audio tokens (audio input and output). VoicePipeline charges separately for STT, LLM, and TTS. For long conversations with short responses, VoicePipeline may be cheaper. For rapid back-and-forth exchanges, the Realtime API may be more cost-effective.

Browser support: Realtime API wins. WebRTC is natively supported in all modern browsers. VoicePipeline requires a server-side component and a WebSocket or similar transport to connect the browser.

Telephony integration: VoicePipeline wins. SIP and PSTN integrations work with server-side audio processing. WebRTC can work with telephony gateways but adds complexity.

Choose based on your highest-priority requirement. Many production systems use a hybrid: the Realtime API for the conversational interface and a VoicePipeline-based backend for complex processing tasks that get triggered by function calls.

Sources:

Building Voice Agents with WebRTC and OpenAI Realtime API

Why WebRTC for Voice Agents

Architecture Overview

Step 1: Backend Ephemeral Key Endpoint

Step 2: Browser WebRTC Client

Step 3: Handling Data Channel Events

Step 4: Function Calling Over WebRTC

Stopping the Conversation

Turn Detection with Server VAD

VoicePipeline vs Realtime API: Production Tradeoffs

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Model-Native Harness: Why OpenAI and Anthropic Are Killing ReAct Loops

GPT-Realtime-Whisper vs Deepgram: Streaming STT in 2026