Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)
Build a working voice agent with the OpenAI Realtime API + Agents SDK, then bolt on an eval pipeline that catches barge-in failures, hallucinated grounding, and latency regressions.
TL;DR
The OpenAI Realtime API has matured enough in 2026 that you can ship a real voice agent in an afternoon. The hard part is the next two weeks — convincing yourself it actually works. This post walks through a working voice agent on gpt-realtime-2025-08-28 using the @openai/agents SDK 0.9.0 (the realtime path), and then bolts on an offline eval pipeline that replays recorded audio sessions to catch the three failure modes that bite voice agents in production: barge-in regressions, hallucinated grounding, and latency creep. Real TypeScript code from our voice agent platform, real numbers from ~280k monthly sessions, no hand-waving.
The Realtime Stack in 2026 — What Actually Changed
The 2025-era pattern of stitching STT + LLM + TTS together with three round trips is dead for any latency-sensitive use case. The pinned model snapshot we run today is gpt-realtime-2025-08-28, which speaks audio in and audio out over a single bidirectional session. The relevant primitives:
- Transport. WebRTC for browser/native clients (sub-200 ms median glass-to-glass), WebSocket for server-side bridges (Twilio, SIP). WebRTC's built-in jitter buffer and forward error correction make it the default for anything talking to a human ear; WebSocket is the right choice when you're already inside an audio bridge that handles loss for you.
- Server-side VAD. OpenAI's
server_vadmode replaced our hand-rolledwebrtcvadPython wrapper. Setturn_detection.type = "server_vad", tunesilence_duration_ms(we run 320 ms), and the model itself decides when the user stopped talking. - Native interruption. When a user starts speaking while the model is mid-utterance, the SDK fires
audio_interruptedand you callresponse.cancelto stop generation. Done correctly, the user never hears the model talk over them. - Tool calls inline. Function calling works mid-response, so
get_appointment_slots()fires while the model is still saying "let me check that for you."
If you've been running the cascaded pattern, the cognitive shift is that the model is the conversation loop, not a request/response endpoint. You connect once per call and keep the socket open for the duration.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
A Minimal Working Voice Agent
Here's the actual TypeScript we run on the server side of CallSphere, trimmed of platform-specific glue. The full version handles SIP audio framing, but the SDK shape is the same:
import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";
import { tool } from "@openai/agents";
import { z } from "zod";
const getSlots = tool({
name: "get_appointment_slots",
description: "Return the next 5 open slots for a given clinic.",
parameters: z.object({
clinic_id: z.string(),
date: z.string().describe("ISO date YYYY-MM-DD"),
}),
execute: async ({ clinic_id, date }) => {
return await db.slots.findOpen({ clinic_id, date, limit: 5 });
},
});
const receptionist = new RealtimeAgent({
name: "Receptionist",
instructions: \`You are a warm, concise medical-office receptionist.
Always confirm the patient's name and date of birth before scheduling.
Never quote prices unless the tool returned a price.\`,
tools: [getSlots],
});
const session = new RealtimeSession(receptionist, {
model: "gpt-realtime-2025-08-28",
config: {
audio: {
input: { format: "pcm16", sampleRate: 24000 },
output: { format: "pcm16", voice: "marin" },
},
turnDetection: {
type: "server_vad",
threshold: 0.55,
silenceDurationMs: 320,
prefixPaddingMs: 200,
},
},
});
await session.connect({ apiKey: process.env.OPENAI_API_KEY! });
session.on("audio_interrupted", () => session.cancelResponse());
session.on("history_updated", (h) => persistTurn(callId, h.at(-1)));
session.on("error", (e) => log.error({ callId, e }, "realtime error"));
\`\`\`
Three things to notice. First, `turnDetection` is where most "the bot keeps talking over me" bugs live — `silenceDurationMs` below 250 ms causes premature cutoffs, above 400 ms causes awkward dead air. Second, the `audio_interrupted` handler is non-negotiable; without it, barge-in feels broken even though the model technically supports it. Third, we persist every turn to Postgres keyed by `callId` so the eval pipeline (next section) has something to chew on.
## Multi-Agent Handoffs Over Realtime
Real production calls are not one agent. A tier-1 receptionist hands off to a billing specialist when the user mentions a balance, or to a human when intent confidence drops. The Agents SDK realtime path supports this via `handoff()`:
```ts
import { handoff } from "@openai/agents";
const billing = new RealtimeAgent({
name: "BillingSpecialist",
instructions: "Handle balance, copay, and payment plan questions only.",
tools: [getBalance, takePayment],
});
const receptionistWithHandoff = new RealtimeAgent({
name: "Receptionist",
instructions: \`...
If the caller asks about a balance, copay, or payment, hand off to BillingSpecialist.\`,
tools: [getSlots],
handoffs: [handoff(billing)],
});
\`\`\`
The handoff is a tool call under the hood — the model decides, the SDK rewires the session to the target agent's instructions and tools without dropping the audio socket. Latency penalty: ~120 ms median for the swap. Worth knowing because if you have four agents in a chain, those add up.
## The Runtime Loop Plus the Parallel Eval Lane
\`\`\`mermaid
flowchart LR
U[Caller audio] -->|WebRTC/WebSocket| RT[Realtime Session]
RT --> M[gpt-realtime-2025-08-28]
M -->|tool call| T[Tools/DB/RAG]
T --> M
M -->|audio out| U
RT -->|persist| P[(Postgres turns + audio refs)]
P --> R[Eval Replay Runner]
R --> S[STT WER]
R --> G[Grounding judge]
R --> B[Barge-in checker]
R --> L[Latency p50/p95]
S & G & B & L --> D[Eval Dashboard]
D -->|regression| AL[Slack alert]
style M fill:#dff
style D fill:#ffd
style AL fill:#fcc
\`\`\`
*Figure 1 — Live runtime is the top lane; everything below the dashed line is offline replay against persisted audio.*
The crucial design choice: **evals never run in the hot path**. The realtime session writes audio chunks and turn metadata to S3 + Postgres, and a batch worker replays them later. This keeps the call-quality budget intact while still giving you a 24-hour feedback loop on quality regressions.
## The Eval Pipeline — Four Things It Has to Catch
Voice agents fail in ways text agents don't. Our eval suite is structured around the four highest-cost failure modes we've seen on real production traffic:
| Failure mode | Symptom | Detector | Cost when missed |
|---|---|---|---|
| STT mishears | Agent confidently answers the wrong question | WER vs. human transcript | High — wrong appointments booked |
| Hallucinated grounding | Agent quotes a price/policy not in the tool result | RAG groundedness judge | Very high — legal/PR risk |
| Barge-in failure | Agent talks over the user | Audio overlap detector | Medium — drives drop-off |
| Latency creep | Time-to-first-audio creeps past 800 ms | Span timing on `response.created` | High — abandonment |
### Replaying Recorded Sessions
The replay runner is a Node script that reads a session ID, pulls the caller-side audio from S3, and pipes it back into a fresh `RealtimeSession` running the *current* agent build. Because the audio is bit-identical to what production heard, the only variable is the agent code.
```ts
import { RealtimeSession } from "@openai/agents/realtime";
import { buildAgent } from "../src/agents/receptionist";
import { readPcmFromS3 } from "./s3";
export async function replay(sessionId: string) {
const audio = await readPcmFromS3(\`sessions/\${sessionId}/caller.pcm\`);
const refTranscript = await loadHumanTranscript(sessionId);
const session = new RealtimeSession(buildAgent(), {
model: "gpt-realtime-2025-08-28",
config: { audio: { input: { format: "pcm16", sampleRate: 24000 } } },
});
const startedAt = Date.now();
let firstAudioMs: number | null = null;
session.on("response.audio.delta", () => {
firstAudioMs ??= Date.now() - startedAt;
});
await session.connect({ apiKey: process.env.OPENAI_API_KEY! });
await session.sendAudio(audio); // streams the recorded caller audio
const out = await session.waitForCompletion();
return {
transcript: out.transcript,
audioOut: out.audioBuffer,
firstAudioMs,
refTranscript,
};
}
\`\`\`
### Computing WER On the Replay
```ts
import { wordErrorRate } from "./metrics/wer";
const { transcript, refTranscript } = await replay(id);
const wer = wordErrorRate(refTranscript, transcript.user);
// On our regression suite, WER < 0.06 passes; > 0.10 blocks the release
\`\`\`
We use the same `wer` implementation against both the user-side transcript (catches STT regressions) and the agent-side transcript (catches TTS phoneme drift after voice changes). On 220 replayed sessions, our current build sits at WER 0.041 user-side, 0.012 agent-side.
### The Grounding Judge
This is the most expensive evaluator and the one that has saved us the most. We run a separate `gpt-4o-2024-08-06` judge over each turn where the agent quoted a number, a date, or a policy:
```ts
const judgePrompt = \`
You are auditing a voice agent transcript. For each agent statement that asserts a fact (price, date, policy), check whether the supporting tool result is in the provided context. Output JSON: { grounded: boolean, evidence: string }.
\`;
\`\`\`
A grounding rate below 0.95 blocks merge. The judge agreement with humans, calibrated quarterly, sits at 0.91 — high enough to trust, low enough that we still spot-check.
### Barge-In Checker
The detector is dead simple but easy to forget. We diff the user audio energy timeline against the agent audio energy timeline; any window where both exceed a VAD threshold for >150 ms is an overlap. Acceptable overlap rate per call: <2%. Above that, the model is not respecting the `audio_interrupted` event — usually because someone tweaked `silenceDurationMs` without re-running the suite.
### Latency Spans
We capture three timings per turn: `time_to_first_audio`, `time_to_tool_call`, and `time_to_response_complete`. Pin them to the trace ID. Our budgets:
| Span | p50 | p95 | Hard ceiling |
|---|---|---|---|
| time_to_first_audio | 480 ms | 780 ms | 1200 ms |
| time_to_tool_call | 240 ms | 410 ms | 800 ms |
| time_to_response_complete | 1.6 s | 2.9 s | 4.5 s |
p95 above the ceiling on more than 5% of replayed calls fails the gate. We started shipping these as Datadog metrics last quarter and the visibility alone caught two regressions before they hit production.
## Wiring It Into CI
The replay suite runs nightly against the previous 24 hours of sampled traffic plus a fixed regression set of 180 sessions that represent past bugs. The runtime is ~12 minutes on a single Node worker pulling 8 sessions in parallel; cost is roughly $9 in OpenAI Realtime credits per nightly run. PRs that touch agent code trigger a smaller 40-session smoke run that finishes in under 4 minutes.
The key trick: **replays use the recorded user audio, not synthesized TTS of a transcript**. Synthesized inputs miss real-world acoustics — background noise, accents, half-words — which is where most STT failures come from. If you only replay synthetic audio, your eval will look great while production keeps breaking.
## What Goes Wrong Without an Eval Pipeline
Before we built this, we shipped a voice change in November 2025 that improved subjective warmth but bumped TTS phoneme drift on numbers — "fifteen" was being rendered closer to "fifty" on certain accents. We caught it after a week, after a clinic complained that two appointments had been booked at the wrong time. After we shipped the eval pipeline, the same class of regression has been caught at PR review three times before merge.
The honest cost of the pipeline: about 12 engineer-days to build, $280/month in OpenAI eval credits at our session volume, and one ongoing review cadence where a clinical SME re-labels 40 reference transcripts a quarter. Cheap compared to one wrong appointment.
## Frequently Asked Questions
### Should I use WebRTC or WebSocket for the realtime session?
WebRTC if the caller is on a browser, mobile app, or anywhere with native WebRTC support — the built-in jitter buffer is worth the integration cost. WebSocket if you're bridging from telephony (Twilio Programmable Voice, SIP, Plivo) where the audio is already PCM frames in a server-to-server context. Mixing them in one product is fine; the SDK abstraction is the same.
### Why not just use the Whisper STT + GPT-4o + TTS cascade?
Latency. The cascaded path adds 600–900 ms of overhead at the joins, which puts you over the perceptual threshold for natural conversation (~700 ms total). The unified Realtime model gets to ~480 ms p50 time-to-first-audio because the audio encoder and the LLM decoder share a forward pass.
### How do I evaluate the audio quality of the model's voice itself?
We run a MOS proxy using a separate audio-quality model on a sampled 50 turns per night. It's noisy but trends are useful — when we changed voices to `marin`, the proxy shifted from 4.1 to 4.3 on the 1–5 scale, which matched user-survey feedback. See our companion piece on [voice agent quality metrics](/blog/voice-agent-quality-metrics-wer-latency-grounding) for the full metric breakdown.
### What about PII in the recorded audio?
Recordings are stored in a customer-isolated S3 prefix with KMS encryption, and the replay runner never leaves our VPC. Reference transcripts get PII tokens replaced (`<PHONE>`, `<DOB>`) before they're checked into the dataset repo. The judge sees the redacted version.
### How do I version the model when OpenAI ships a new realtime snapshot?
Pin the snapshot ID (`gpt-realtime-2025-08-28`) in code, not the floating alias. When a new snapshot drops, run the full eval suite against it on a branch, write down the score deltas per evaluator, and decide as a team whether to upgrade. We've upgraded twice this year; once we accepted a 0.4-point dip in tone score in exchange for a 90 ms latency improvement, and once we held off because grounding regressed by 1.2 points.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.