---
title: "Headless WebRTC Testing for AI Voice Agents: Playwright vs Selenium (2026)"
description: "How to test an AI voice agent end-to-end with a headless browser, fake audio, and a deterministic harness in 2026. Playwright wins; here is the production setup."
canonical: https://callsphere.ai/blog/vw3e-headless-webrtc-testing-playwright-selenium-2026
category: "AI Engineering"
tags: ["WebRTC", "Testing", "Playwright", "Selenium", "Voice AI"]
author: "CallSphere Team"
published: 2026-03-27T00:00:00.000Z
updated: 2026-05-07T09:59:24.690Z
---

# Headless WebRTC Testing for AI Voice Agents: Playwright vs Selenium (2026)

> How to test an AI voice agent end-to-end with a headless browser, fake audio, and a deterministic harness in 2026. Playwright wins; here is the production setup.

> You cannot ship an AI voice agent that is only tested by humans. Headless browsers with fake-audio capture and deterministic media injection are the 2026 way to run thousands of voice scenarios in CI.

## Why headless tests for AI voice

A full-stack voice agent test exercises: ephemeral token mint, ICE gathering, SDP exchange, DataChannel function calls, audio playback, and end-of-turn detection. None of that runs reliably against an HTTP mock. You need a real browser, real WebRTC, and a real way to feed audio in and capture audio out.

In 2026 the practical answer is Playwright with Chromium. Selenium still works (and Selenium's WebDriver BiDi has caught up), but Playwright's auto-wait, tracing, and built-in browser launch flags make it 5–10x less flaky for media tests. New greenfield projects in 2026 should not start on Selenium unless your team has heavy Java or C# investment, or Safari is in your test matrix and you need legacy WebDriver paths.

The model has shifted twice in two years: first away from Puppeteer (which Playwright outgrew on cross-browser support), then toward MCP-driven test harnesses where Claude or Cursor invoke browser tools directly. Both still rely on the same fake-mic primitives.

## Architecture pattern

```mermaid
flowchart LR
  CI[CI runner] -- launch headless --> Chromium
  Chromium -- fake mic --> WAV[(scripted audio)]
  Chromium -- WebRTC --> Agent[AI voice agent]
  Agent -- audio --> Chromium
  Chromium -- captured WAV --> Asserts[Whisper / golden checks]
```

The browser is launched with two key Chromium flags: `--use-fake-ui-for-media-stream` (auto-grants mic permission) and `--use-file-for-fake-audio-capture=/path/to/in.wav` (substitutes the mic with a WAV). Outbound audio is captured by recording a remote `MediaStreamTrack` to a `MediaRecorder`.

## CallSphere implementation

CallSphere runs ~600 headless voice scenarios per night across the six verticals (real estate, healthcare, behavioral health, legal, salon, insurance). The pipeline:

- Playwright matrix runs Chromium + Firefox; one Chromium worker per vertical.
- Each test injects a vertical-specific WAV ("I would like to schedule a tour at 222 Main"), runs the conversation through OpenAI Realtime, and captures the agent's audio.
- We transcribe captured audio with Whisper and assert keywords plus a tool-call event log captured from the DataChannel.
- For Real Estate (OneRoof, [/industries/real-estate](/industries/real-estate)) the test additionally verifies the Pion Go gateway 1.23 emitted the right NATS event and the 6-container pod (CRM, MLS, calendar, SMS, audit, transcript) created a CRM record.

Across 37 agents, 90+ tools, and 115+ database tables this has caught regressions in tool registries, latency budgets, and SDP munging within hours of merge. SOC 2 + HIPAA test fixtures use synthetic data only. Pricing $149/$499/$1499 with the 14-day trial; affiliates 22% — see [/affiliate](/affiliate).

## Code snippet (Playwright)

```ts
import { test, expect, chromium } from "@playwright/test";

test("real-estate agent schedules tour", async () => {
  const browser = await chromium.launch({
    args: [
      "--use-fake-ui-for-media-stream",
      "--use-fake-device-for-media-stream",
      "--use-file-for-fake-audio-capture=fixtures/realestate-schedule.wav",
      "--autoplay-policy=no-user-gesture-required",
    ],
  });
  const ctx = await browser.newContext();
  const page = await ctx.newPage();
  await page.goto("[https://callsphere.ai/demo?vertical=real-estate](https://callsphere.ai/demo?vertical=real-estate)");
  await page.click("button[data-test=start-call]");
  // wait until DataChannel emits a confirmed tool call
  const event = await page.waitForFunction(() => (window as any).__lastToolCall);
  expect(await event.jsonValue()).toMatchObject({ name: "schedule_showing" });
  await browser.close();
});
```

## Build steps

1. Use Playwright (default) or Selenium 4 with WebDriver BiDi if you must.
2. Launch Chromium with fake-mic flags; never let CI runners depend on real hardware.
3. Pre-record fixture WAVs at 16 kHz mono — that is what most STT pipelines use.
4. Capture the remote audio track via `new MediaRecorder(remoteStream)`; dump WebM into a tmp file.
5. Transcribe captured audio with a small Whisper model and assert phrases, not exact text.
6. Snapshot the DataChannel event stream into a JSON log; diff against a golden log per scenario.
7. Run the same matrix on Firefox once a week to catch browser drift.
8. Upload the captured audio + DataChannel log as CI artifacts; engineers need them for postmortems.

## Common pitfalls

- **Headed mode in CI** — slow, flaky, and depends on a display server. Always headless.
- **Real microphone in CI** — picks up runner-host noise; deterministic fixtures only.
- **Comparing exact LLM text** — temperature drift breaks tests. Assert tool calls (deterministic) or keyword presence.
- **Skipping audio capture** — without it, you only test that the agent did not crash, not that it actually spoke.
- **Forgetting to teardown** — leaked PeerConnections accumulate and OOM long-running CI workers.

## FAQ

**Why not curl the model directly?** That misses ICE, SDP, DataChannel ordering, and audio jitter — exactly the layers that break in production.

**Does WebRTC work in headless Chrome?** Yes — JavaScript execution, WebRTC, service workers all work the same as headed.

**What about Safari?** WebKit's headless support lags. Run a manual cross-browser pass weekly; do not block CI on Safari.

**How do I avoid LLM nondeterminism?** Use `temperature: 0` for tests, lock prompts, and assert on tool calls (deterministic) rather than wording.

**Can I run hundreds in parallel?** Yes — one Chromium per worker, ~250 MB RAM each. We run 32 in parallel on a single c6i.8xlarge.

**Does Selenium support fake mic?** Yes via Chromium options — but Playwright handles them more cleanly.

**What about MCP-driven tests?** Playwright MCP is great for ad-hoc exploration; for repeatable CI use the standard Playwright runner.

**Can I record real users?** Not without consent. For QA fixtures, synthesize with a TTS pass instead of recording prospects.

## Production playbook for AI voice teams in 2026

Three rules from running 600 nightly scenarios:

1. **Per-vertical fixtures.** A real-estate scenario and a healthcare scenario need different audio, different tools, different golden logs. Share the runner, not the data.
2. **Whisper assertions, not transcript equality.** Assert that the agent said the right tool name and a few keywords; never demand exact wording, the model will drift.
3. **Snapshot the WebM.** When a test fails, the captured audio is the smoking gun 80% of the time. Cheap to keep, expensive to lack.

We also run a weekly chaos pass: random jitter and packet loss injected via `tc netem` on the runner. Catches degradation we would otherwise only see in P99 customer reports.

## Watch list 2026

- **WebDriver BiDi** is finally cross-browser as of 2026; Selenium 4.20+ and Playwright both speak it. For media tests, BiDi removes the last reason to keep CDP-only paths.
- **MCP-driven Playwright** lets agents drive the browser. We use it for exploratory QA but not CI; the determinism is not there yet.
- **Headless Safari** is graduating from "experimental" but still missing fake-mic flags; do not bet CI on it.
- **WebRTC.ventures' QA framework** for AI voice — the only public reference architecture for this class of test as of 2026.

## Sources

- [https://playwright.dev/](https://playwright.dev/)
- [https://webrtc.org/getting-started/testing](https://webrtc.org/getting-started/testing)
- [https://webrtc.ventures/2026/03/qa-testing-for-ai-voice-agents/](https://webrtc.ventures/2026/03/qa-testing-for-ai-voice-agents/)
- [https://antmedia.io/webrtc-testing-with-selenium/](https://antmedia.io/webrtc-testing-with-selenium/)
- [https://helpmetest.com/blog/headless-chrome/](https://helpmetest.com/blog/headless-chrome/)
- [https://www.daily.co/blog/how-to-make-a-headless-robot-to-test-webrtc-in-your-daily-app/](https://www.daily.co/blog/how-to-make-a-headless-robot-to-test-webrtc-in-your-daily-app/)

Try the production agent on [/demo](/demo), check [/pricing](/pricing), or start a [/trial](/trial).

---

Source: https://callsphere.ai/blog/vw3e-headless-webrtc-testing-playwright-selenium-2026