---
title: "Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)"
description: "Build a working voice agent with the OpenAI Realtime API + Agents SDK, then bolt on an eval pipeline that catches barge-in failures, hallucinated grounding, and latency regressions."
canonical: https://callsphere.ai/blog/openai-realtime-voice-agents-eval-pipeline-2026
category: "Agentic AI"
tags: ["Voice Agents", "OpenAI Realtime API", "Agent Evaluation", "WER", "Production AI", "Conversational AI"]
author: "CallSphere Team"
published: 2026-05-06T00:00:00.000Z
updated: 2026-05-06T07:06:01.618Z
---

# Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)

> Build a working voice agent with the OpenAI Realtime API + Agents SDK, then bolt on an eval pipeline that catches barge-in failures, hallucinated grounding, and latency regressions.

## TL;DR

The OpenAI Realtime API has matured enough in 2026 that you can ship a real voice agent in an afternoon. The hard part is the next two weeks — convincing yourself it actually works. This post walks through a working voice agent on `gpt-realtime-2025-08-28` using the [`@openai/agents`](https://github.com/openai/openai-agents-js) SDK 0.9.0 (the realtime path), and then bolts on an offline eval pipeline that replays recorded audio sessions to catch the three failure modes that bite voice agents in production: **barge-in regressions, hallucinated grounding, and latency creep**. Real TypeScript code from our [voice agent platform](/products), real numbers from ~280k monthly sessions, no hand-waving.

## The Realtime Stack in 2026 — What Actually Changed

The 2025-era pattern of stitching STT + LLM + TTS together with three round trips is dead for any latency-sensitive use case. The pinned model snapshot we run today is `gpt-realtime-2025-08-28`, which speaks audio in and audio out over a single bidirectional session. The relevant primitives:

- **Transport.** WebRTC for browser/native clients (sub-200 ms median glass-to-glass), WebSocket for server-side bridges (Twilio, SIP). WebRTC's built-in jitter buffer and forward error correction make it the default for anything talking to a human ear; WebSocket is the right choice when you're already inside an audio bridge that handles loss for you.
- **Server-side VAD.** OpenAI's `server_vad` mode replaced our hand-rolled `webrtcvad` Python wrapper. Set `turn_detection.type = "server_vad"`, tune `silence_duration_ms` (we run 320 ms), and the model itself decides when the user stopped talking.
- **Native interruption.** When a user starts speaking while the model is mid-utterance, the SDK fires `audio_interrupted` and you call `response.cancel` to stop generation. Done correctly, the user never hears the model talk over them.
- **Tool calls inline.** Function calling works mid-response, so `get_appointment_slots()` fires while the model is still saying "let me check that for you."

If you've been running the cascaded pattern, the cognitive shift is that **the model is the conversation loop**, not a request/response endpoint. You connect once per call and keep the socket open for the duration.

## A Minimal Working Voice Agent

Here's the actual TypeScript we run on the server side of [CallSphere](/products), trimmed of platform-specific glue. The full version handles SIP audio framing, but the SDK shape is the same:

```ts
import { RealtimeAgent, RealtimeSession } from "@openai/agents/realtime";
import { tool } from "@openai/agents";
import { z } from "zod";

const getSlots = tool({
  name: "get_appointment_slots",
  description: "Return the next 5 open slots for a given clinic.",
  parameters: z.object({
    clinic_id: z.string(),
    date: z.string().describe("ISO date YYYY-MM-DD"),
  }),
  execute: async ({ clinic_id, date }) => {
    return await db.slots.findOpen({ clinic_id, date, limit: 5 });
  },
});

const receptionist = new RealtimeAgent({
  name: "Receptionist",
  instructions: \`You are a warm, concise medical-office receptionist.
Always confirm the patient's name and date of birth before scheduling.
Never quote prices unless the tool returned a price.\`,
  tools: [getSlots],
});

const session = new RealtimeSession(receptionist, {
  model: "gpt-realtime-2025-08-28",
  config: {
    audio: {
      input: { format: "pcm16", sampleRate: 24000 },
      output: { format: "pcm16", voice: "marin" },
    },
    turnDetection: {
      type: "server_vad",
      threshold: 0.55,
      silenceDurationMs: 320,
      prefixPaddingMs: 200,
    },
  },
});

await session.connect({ apiKey: process.env.OPENAI_API_KEY! });

session.on("audio_interrupted", () => session.cancelResponse());
session.on("history_updated", (h) => persistTurn(callId, h.at(-1)));
session.on("error", (e) => log.error({ callId, e }, "realtime error"));
\`\`\`

Three things to notice. First, `turnDetection` is where most "the bot keeps talking over me" bugs live — `silenceDurationMs` below 250 ms causes premature cutoffs, above 400 ms causes awkward dead air. Second, the `audio_interrupted` handler is non-negotiable; without it, barge-in feels broken even though the model technically supports it. Third, we persist every turn to Postgres keyed by `callId` so the eval pipeline (next section) has something to chew on.

## Multi-Agent Handoffs Over Realtime

Real production calls are not one agent. A tier-1 receptionist hands off to a billing specialist when the user mentions a balance, or to a human when intent confidence drops. The Agents SDK realtime path supports this via `handoff()`:

```ts
import { handoff } from "@openai/agents";

const billing = new RealtimeAgent({
  name: "BillingSpecialist",
  instructions: "Handle balance, copay, and payment plan questions only.",
  tools: [getBalance, takePayment],
});

const receptionistWithHandoff = new RealtimeAgent({
  name: "Receptionist",
  instructions: \`...
If the caller asks about a balance, copay, or payment, hand off to BillingSpecialist.\`,
  tools: [getSlots],
  handoffs: [handoff(billing)],
});
\`\`\`

The handoff is a tool call under the hood — the model decides, the SDK rewires the session to the target agent's instructions and tools without dropping the audio socket. Latency penalty: ~120 ms median for the swap. Worth knowing because if you have four agents in a chain, those add up.

## The Runtime Loop Plus the Parallel Eval Lane

\`\`\`mermaid
flowchart LR
  U[Caller audio] -->|WebRTC/WebSocket| RT[Realtime Session]
  RT --> M[gpt-realtime-2025-08-28]
  M -->|tool call| T[Tools/DB/RAG]
  T --> M
  M -->|audio out| U
  RT -->|persist| P[(Postgres turns + audio refs)]
  P --> R[Eval Replay Runner]
  R --> S[STT WER]
  R --> G[Grounding judge]
  R --> B[Barge-in checker]
  R --> L[Latency p50/p95]
  S & G & B & L --> D[Eval Dashboard]
  D -->|regression| AL[Slack alert]
  style M fill:#dff
  style D fill:#ffd
  style AL fill:#fcc
\`\`\`

*Figure 1 — Live runtime is the top lane; everything below the dashed line is offline replay against persisted audio.*

The crucial design choice: **evals never run in the hot path**. The realtime session writes audio chunks and turn metadata to S3 + Postgres, and a batch worker replays them later. This keeps the call-quality budget intact while still giving you a 24-hour feedback loop on quality regressions.

## The Eval Pipeline — Four Things It Has to Catch

Voice agents fail in ways text agents don't. Our eval suite is structured around the four highest-cost failure modes we've seen on real production traffic:

| Failure mode | Symptom | Detector | Cost when missed |
|---|---|---|---|
| STT mishears | Agent confidently answers the wrong question | WER vs. human transcript | High — wrong appointments booked |
| Hallucinated grounding | Agent quotes a price/policy not in the tool result | RAG groundedness judge | Very high — legal/PR risk |
| Barge-in failure | Agent talks over the user | Audio overlap detector | Medium — drives drop-off |
| Latency creep | Time-to-first-audio creeps past 800 ms | Span timing on `response.created` | High — abandonment |

### Replaying Recorded Sessions

The replay runner is a Node script that reads a session ID, pulls the caller-side audio from S3, and pipes it back into a fresh `RealtimeSession` running the *current* agent build. Because the audio is bit-identical to what production heard, the only variable is the agent code.

```ts
import { RealtimeSession } from "@openai/agents/realtime";
import { buildAgent } from "../src/agents/receptionist";
import { readPcmFromS3 } from "./s3";

export async function replay(sessionId: string) {
  const audio = await readPcmFromS3(\`sessions/\${sessionId}/caller.pcm\`);
  const refTranscript = await loadHumanTranscript(sessionId);

  const session = new RealtimeSession(buildAgent(), {
    model: "gpt-realtime-2025-08-28",
    config: { audio: { input: { format: "pcm16", sampleRate: 24000 } } },
  });

  const startedAt = Date.now();
  let firstAudioMs: number | null = null;

  session.on("response.audio.delta", () => {
    firstAudioMs ??= Date.now() - startedAt;
  });

  await session.connect({ apiKey: process.env.OPENAI_API_KEY! });
  await session.sendAudio(audio); // streams the recorded caller audio
  const out = await session.waitForCompletion();

  return {
    transcript: out.transcript,
    audioOut: out.audioBuffer,
    firstAudioMs,
    refTranscript,
  };
}
\`\`\`

### Computing WER On the Replay

```ts
import { wordErrorRate } from "./metrics/wer";

const { transcript, refTranscript } = await replay(id);
const wer = wordErrorRate(refTranscript, transcript.user);
// On our regression suite, WER  0.10 blocks the release
\`\`\`

We use the same `wer` implementation against both the user-side transcript (catches STT regressions) and the agent-side transcript (catches TTS phoneme drift after voice changes). On 220 replayed sessions, our current build sits at WER 0.041 user-side, 0.012 agent-side.

### The Grounding Judge

This is the most expensive evaluator and the one that has saved us the most. We run a separate `gpt-4o-2024-08-06` judge over each turn where the agent quoted a number, a date, or a policy:

```ts
const judgePrompt = \`
You are auditing a voice agent transcript. For each agent statement that asserts a fact (price, date, policy), check whether the supporting tool result is in the provided context. Output JSON: { grounded: boolean, evidence: string }.
\`;
\`\`\`

A grounding rate below 0.95 blocks merge. The judge agreement with humans, calibrated quarterly, sits at 0.91 — high enough to trust, low enough that we still spot-check.

### Barge-In Checker

The detector is dead simple but easy to forget. We diff the user audio energy timeline against the agent audio energy timeline; any window where both exceed a VAD threshold for >150 ms is an overlap. Acceptable overlap rate per call: `, ``) before they're checked into the dataset repo. The judge sees the redacted version.

### How do I version the model when OpenAI ships a new realtime snapshot?

Pin the snapshot ID (`gpt-realtime-2025-08-28`) in code, not the floating alias. When a new snapshot drops, run the full eval suite against it on a branch, write down the score deltas per evaluator, and decide as a team whether to upgrade. We've upgraded twice this year; once we accepted a 0.4-point dip in tone score in exchange for a 90 ms latency improvement, and once we held off because grounding regressed by 1.2 points.
```

---

Source: https://callsphere.ai/blog/openai-realtime-voice-agents-eval-pipeline-2026
