Skip to content
AI Engineering
AI Engineering11 min0 views

Headless WebRTC Testing for AI Voice Agents: Playwright vs Selenium (2026)

How to test an AI voice agent end-to-end with a headless browser, fake audio, and a deterministic harness in 2026. Playwright wins; here is the production setup.

You cannot ship an AI voice agent that is only tested by humans. Headless browsers with fake-audio capture and deterministic media injection are the 2026 way to run thousands of voice scenarios in CI.

Why headless tests for AI voice

A full-stack voice agent test exercises: ephemeral token mint, ICE gathering, SDP exchange, DataChannel function calls, audio playback, and end-of-turn detection. None of that runs reliably against an HTTP mock. You need a real browser, real WebRTC, and a real way to feed audio in and capture audio out.

In 2026 the practical answer is Playwright with Chromium. Selenium still works (and Selenium's WebDriver BiDi has caught up), but Playwright's auto-wait, tracing, and built-in browser launch flags make it 5–10x less flaky for media tests. New greenfield projects in 2026 should not start on Selenium unless your team has heavy Java or C# investment, or Safari is in your test matrix and you need legacy WebDriver paths.

The model has shifted twice in two years: first away from Puppeteer (which Playwright outgrew on cross-browser support), then toward MCP-driven test harnesses where Claude or Cursor invoke browser tools directly. Both still rely on the same fake-mic primitives.

Architecture pattern

```mermaid flowchart LR CI[CI runner] -- launch headless --> Chromium Chromium -- fake mic --> WAV[(scripted audio)] Chromium -- WebRTC --> Agent[AI voice agent] Agent -- audio --> Chromium Chromium -- captured WAV --> Asserts[Whisper / golden checks] ```

The browser is launched with two key Chromium flags: `--use-fake-ui-for-media-stream` (auto-grants mic permission) and `--use-file-for-fake-audio-capture=/path/to/in.wav` (substitutes the mic with a WAV). Outbound audio is captured by recording a remote `MediaStreamTrack` to a `MediaRecorder`.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

CallSphere implementation

CallSphere runs ~600 headless voice scenarios per night across the six verticals (real estate, healthcare, behavioral health, legal, salon, insurance). The pipeline:

  • Playwright matrix runs Chromium + Firefox; one Chromium worker per vertical.
  • Each test injects a vertical-specific WAV ("I would like to schedule a tour at 222 Main"), runs the conversation through OpenAI Realtime, and captures the agent's audio.
  • We transcribe captured audio with Whisper and assert keywords plus a tool-call event log captured from the DataChannel.
  • For Real Estate (OneRoof, /industries/real-estate) the test additionally verifies the Pion Go gateway 1.23 emitted the right NATS event and the 6-container pod (CRM, MLS, calendar, SMS, audit, transcript) created a CRM record.

Across 37 agents, 90+ tools, and 115+ database tables this has caught regressions in tool registries, latency budgets, and SDP munging within hours of merge. SOC 2 + HIPAA test fixtures use synthetic data only. Pricing $149/$499/$1499 with the 14-day trial; affiliates 22% — see /affiliate.

Code snippet (Playwright)

```ts import { test, expect, chromium } from "@playwright/test";

test("real-estate agent schedules tour", async () => { const browser = await chromium.launch({ args: [ "--use-fake-ui-for-media-stream", "--use-fake-device-for-media-stream", "--use-file-for-fake-audio-capture=fixtures/realestate-schedule.wav", "--autoplay-policy=no-user-gesture-required", ], }); const ctx = await browser.newContext(); const page = await ctx.newPage(); await page.goto("https://callsphere.ai/demo?vertical=real-estate"); await page.click("button[data-test=start-call]"); // wait until DataChannel emits a confirmed tool call const event = await page.waitForFunction(() => (window as any).__lastToolCall); expect(await event.jsonValue()).toMatchObject({ name: "schedule_showing" }); await browser.close(); }); ```

Build steps

  1. Use Playwright (default) or Selenium 4 with WebDriver BiDi if you must.
  2. Launch Chromium with fake-mic flags; never let CI runners depend on real hardware.
  3. Pre-record fixture WAVs at 16 kHz mono — that is what most STT pipelines use.
  4. Capture the remote audio track via `new MediaRecorder(remoteStream)`; dump WebM into a tmp file.
  5. Transcribe captured audio with a small Whisper model and assert phrases, not exact text.
  6. Snapshot the DataChannel event stream into a JSON log; diff against a golden log per scenario.
  7. Run the same matrix on Firefox once a week to catch browser drift.
  8. Upload the captured audio + DataChannel log as CI artifacts; engineers need them for postmortems.

Common pitfalls

  • Headed mode in CI — slow, flaky, and depends on a display server. Always headless.
  • Real microphone in CI — picks up runner-host noise; deterministic fixtures only.
  • Comparing exact LLM text — temperature drift breaks tests. Assert tool calls (deterministic) or keyword presence.
  • Skipping audio capture — without it, you only test that the agent did not crash, not that it actually spoke.
  • Forgetting to teardown — leaked PeerConnections accumulate and OOM long-running CI workers.

FAQ

Why not curl the model directly? That misses ICE, SDP, DataChannel ordering, and audio jitter — exactly the layers that break in production.

Does WebRTC work in headless Chrome? Yes — JavaScript execution, WebRTC, service workers all work the same as headed.

What about Safari? WebKit's headless support lags. Run a manual cross-browser pass weekly; do not block CI on Safari.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

How do I avoid LLM nondeterminism? Use `temperature: 0` for tests, lock prompts, and assert on tool calls (deterministic) rather than wording.

Can I run hundreds in parallel? Yes — one Chromium per worker, ~250 MB RAM each. We run 32 in parallel on a single c6i.8xlarge.

Does Selenium support fake mic? Yes via Chromium options — but Playwright handles them more cleanly.

What about MCP-driven tests? Playwright MCP is great for ad-hoc exploration; for repeatable CI use the standard Playwright runner.

Can I record real users? Not without consent. For QA fixtures, synthesize with a TTS pass instead of recording prospects.

Production playbook for AI voice teams in 2026

Three rules from running 600 nightly scenarios:

  1. Per-vertical fixtures. A real-estate scenario and a healthcare scenario need different audio, different tools, different golden logs. Share the runner, not the data.
  2. Whisper assertions, not transcript equality. Assert that the agent said the right tool name and a few keywords; never demand exact wording, the model will drift.
  3. Snapshot the WebM. When a test fails, the captured audio is the smoking gun 80% of the time. Cheap to keep, expensive to lack.

We also run a weekly chaos pass: random jitter and packet loss injected via `tc netem` on the runner. Catches degradation we would otherwise only see in P99 customer reports.

Watch list 2026

  • WebDriver BiDi is finally cross-browser as of 2026; Selenium 4.20+ and Playwright both speak it. For media tests, BiDi removes the last reason to keep CDP-only paths.
  • MCP-driven Playwright lets agents drive the browser. We use it for exploratory QA but not CI; the determinism is not there yet.
  • Headless Safari is graduating from "experimental" but still missing fake-mic flags; do not bet CI on it.
  • WebRTC.ventures' QA framework for AI voice — the only public reference architecture for this class of test as of 2026.

Sources

Try the production agent on /demo, check /pricing, or start a /trial.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Voice Agents

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)

BrowserStack offers 30,000+ real devices; Sauce Labs ships deep Appium automation. Here is how AI voice agent teams use both for WebRTC mobile QA in 2026.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Agentic AI

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Logistics Dispatch Voice Agent 2026: Driver Hotline + Load Assignment Hands-Free

Trucking dispatchers spend half their day on check-calls. Here is how a 2026 AI voice agent runs the driver hotline, assigns loads, and updates the TMS in real time.