---
title: "Building Voice Agents with the OpenAI Realtime API: Full Tutorial"
description: "Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling."
canonical: https://callsphere.ai/blog/openai-realtime-api-voice-agents-tutorial
category: "Technical Guides"
tags: ["AI Voice Agent", "Technical Guide", "OpenAI", "Realtime API", "WebSocket", "Function Calling", "Tutorial"]
author: "CallSphere Team"
published: 2026-04-08T00:00:00.000Z
updated: 2026-05-06T01:02:47.143Z
---

# Building Voice Agents with the OpenAI Realtime API: Full Tutorial

> Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.

## Why this API changed the playbook

Before the Realtime API, building a voice agent meant wiring together Whisper (or Deepgram), an LLM, and a TTS service over three separate connections, then fighting a constant battle with latency and interruption handling. The Realtime API collapses all three into one WebSocket that streams audio in and audio out and surfaces a clean event model for interruptions and tool calls.

This is a hands-on tutorial for building a working voice agent on top of the Realtime API. It does not assume a telephony provider — you can run everything locally with a laptop microphone first, then swap in Twilio later.

```
mic  ──PCM16──►  Realtime API  ──PCM16──►  speaker
                      │
                      ├── session.created
                      ├── input_audio_buffer.speech_started
                      ├── response.audio.delta
                      ├── response.function_call_arguments.done
                      └── response.done
```

## Architecture overview

```
┌───────────────────────────────┐
│ Node.js client                │
│ • sounddevice / portaudio     │
│ • WebSocket to Realtime API   │
│ • tool dispatcher             │
└───────────────┬───────────────┘
                │
                ▼
┌───────────────────────────────┐
│ OpenAI Realtime API           │
│ gpt-4o-realtime-preview-      │
│ 2025-06-03                    │
└───────────────────────────────┘
```

## Prerequisites

- Node.js 20+ or Python 3.11+.
- An OpenAI API key with Realtime access.
- PortAudio (macOS: `brew install portaudio`, Linux: `apt install libportaudio2`).
- Basic familiarity with WebSocket events.

## Step-by-step walkthrough

### 1. Open the WebSocket and configure the session

```typescript
import WebSocket from "ws";

const ws = new WebSocket(
  "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03",
  {
    headers: {
      Authorization: "Bearer " + process.env.OPENAI_API_KEY,
      "OpenAI-Beta": "realtime=v1",
    },
  },
);

ws.on("open", () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      voice: "alloy",
      instructions: "You are a friendly receptionist for Acme Clinic.",
      input_audio_format: "pcm16",
      output_audio_format: "pcm16",
      turn_detection: { type: "server_vad", silence_duration_ms: 400, threshold: 0.5 },
      tools: [
        {
          type: "function",
          name: "check_availability",
          description: "Check provider availability",
          parameters: {
            type: "object",
            properties: {
              provider_id: { type: "string" },
              date: { type: "string", description: "YYYY-MM-DD" },
            },
            required: ["provider_id", "date"],
          },
        },
      ],
    },
  }));
});
```

### 2. Stream microphone audio

```typescript
import { spawn } from "child_process";

// arecord pipes PCM16 at 24kHz mono to stdout
const mic = spawn("arecord", ["-q", "-f", "S16_LE", "-r", "24000", "-c", "1", "-t", "raw"]);

mic.stdout.on("data", (chunk) => {
  ws.send(JSON.stringify({
    type: "input_audio_buffer.append",
    audio: chunk.toString("base64"),
  }));
});
```

### 3. Play back the model's audio

```typescript
import { spawn as spawn2 } from "child_process";

const speaker = spawn2("aplay", ["-q", "-f", "S16_LE", "-r", "24000", "-c", "1"]);

ws.on("message", (raw) => {
  const evt = JSON.parse(raw.toString());
  if (evt.type === "response.audio.delta") {
    speaker.stdin.write(Buffer.from(evt.delta, "base64"));
  }
});
```

### 4. Handle function calls

```typescript
ws.on("message", async (raw) => {
  const evt = JSON.parse(raw.toString());
  if (evt.type === "response.function_call_arguments.done") {
    const args = JSON.parse(evt.arguments);
    let result: unknown;
    if (evt.name === "check_availability") {
      result = await checkAvailability(args.provider_id, args.date);
    }
    ws.send(JSON.stringify({
      type: "conversation.item.create",
      item: {
        type: "function_call_output",
        call_id: evt.call_id,
        output: JSON.stringify(result),
      },
    }));
    ws.send(JSON.stringify({ type: "response.create" }));
  }
});
```

### 5. Handle interruptions

When the caller starts speaking mid-response, clear the output buffer and cancel the in-flight response.

```mermaid
flowchart LR
    CALLER(["Student or Parent"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Education AI Agent"]
        STT["Streaming STT
Deepgram or Whisper"]
        NLU{"Intent and
Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS
ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and
Schedule")]
        KB[("Knowledge Base
and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Enrollment captured"])
        O2(["Tour scheduled"])
        O3(["Counselor callback"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS  CRM
    TOOLS  CAL
    TOOLS  KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
```

```typescript
if (evt.type === "input_audio_buffer.speech_started") {
  ws.send(JSON.stringify({ type: "response.cancel" }));
}
```

### 6. Log the transcript

The Realtime API emits transcript deltas for both sides. Collect them for later analysis.

```typescript
if (evt.type === "conversation.item.input_audio_transcription.completed") {
  console.log("user:", evt.transcript);
}
if (evt.type === "response.audio_transcript.done") {
  console.log("agent:", evt.transcript);
}
```

## Production considerations

- **Heartbeats**: send a WebSocket ping every 15s to keep the connection alive through proxies.
- **Reconnects**: on unexpected close, reconnect with exponential backoff and replay the last session config.
- **Rate limits**: the Realtime API has concurrent session limits per org. Monitor and scale your quota.
- **Cost**: charge by input/output audio minute. Hang up on silence aggressively.
- **PII**: the transcript contains everything callers say. Encrypt at rest and scope access.

## CallSphere's real implementation

CallSphere uses the OpenAI Realtime API with `gpt-4o-realtime-preview-2025-06-03` as the core of its voice and chat agents. Server VAD is on, audio is PCM16 at 24kHz, and every vertical ships its own tool schema: 14 tools for healthcare (insurance verification, appointment booking, provider lookup, and more), 10 agents for real estate, 4 for salon, 7 for after-hours escalation, 10 plus RAG for IT helpdesk, and an ElevenLabs TTS pod with 5 GPT-4 specialists for sales.

Multi-agent handoffs run through the OpenAI Agents SDK so a single caller can be routed from a triage agent to a specialist mid-call without dropping audio. Post-call analytics are handled by a GPT-4o-mini pipeline that writes sentiment, intent, and lead score into per-vertical Postgres. CallSphere supports 57+ languages and keeps end-to-end response time under one second.

## Common pitfalls

- **Wrong sample rate**: 16kHz audio will work but degrade quality; stick to 24kHz.
- **Not handling function_call_arguments.done**: you will miss tool calls.
- **Pushing audio faster than realtime**: the API expects near-realtime ingest; bursty pushes confuse VAD.
- **Ignoring response.done**: you lose the end-of-turn signal.
- **No reconnect logic**: the socket will drop eventually; plan for it.

## FAQ

### Can I use this with a phone number?

Yes — bridge Twilio Media Streams to your WebSocket server and forward audio in both directions.

### What is the difference between server VAD and client VAD?

Server VAD runs on OpenAI's side and generates `speech_started` events automatically. Client VAD lets you control turn-taking manually. Start with server VAD.

### How do I change the voice mid-call?

Send another `session.update` with the new voice name. Do it between turns, not during a response.

### Does it support streaming function outputs back?

Yes — once you send the function_call_output item, the model picks up and continues speaking.

### Can I use multiple tools in one turn?

Yes. The model can emit multiple tool calls, and you should respond to each before calling `response.create`.

## Next steps

Want to see a full Realtime API deployment in production? [Book a demo](https://callsphere.tech/contact), explore the [technology page](https://callsphere.tech/technology), or browse [pricing](https://callsphere.tech/pricing).

#CallSphere #OpenAIRealtime #VoiceAI #Tutorial #WebSocket #FunctionCalling #AIVoiceAgents

---

Source: https://callsphere.ai/blog/openai-realtime-api-voice-agents-tutorial