Skip to content
AI Infrastructure
AI Infrastructure11 min read0 views

Synthetic Monitoring for Voice Agents: Checkly, Datadog, and Building Your Own

Real users generate noise. Synthetic checks generate signal. Here's how to run a fake voice call against your agent every minute and catch regressions before customers do.

TL;DR — Real-traffic SLOs detect regressions late. A 1-minute synthetic call detects them in 60 seconds. Combine both.

What goes wrong

flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
CallSphere reference architecture

Synthetic monitoring is well-understood for HTTP — Datadog Synthetics and Checkly let you run a Playwright script every minute and alert on failure. The same idea applied to voice is rarer, because nobody ships an "audio Playwright." A real synthetic voice check has to: place a phone call (or open a WebRTC peer), play a pre-recorded utterance, score the agent's response, and report metrics.

Without it, your first signal of a regression is a real customer call — at which point the bad experience is already shipped.

How to monitor

A synthetic voice check should test:

  1. Connect path — phone number rings, call is answered, audio negotiates.
  2. First-token latency — how long until the agent speaks back.
  3. Intent match — does the agent's first reply match the expected intent for the test utterance.
  4. Transactional path — can the agent complete a known booking/transfer flow.
  5. Cost — do not exceed N tokens or M cents per check.

Run one synthetic per vertical every minute. Run a longer transactional check every 15 minutes. Page on three consecutive failures.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

CallSphere stack

CallSphere built its own synthetic harness because off-the-shelf doesn't do voice well in 2026. Architecture:

  • Caller bot in Go using Pion WebRTC and a pre-recorded Opus utterance.
  • STT scoring via Deepgram (cheap and fast for synthetic).
  • Intent classifier via gpt-4o-mini judging "did the response match expected intent."
  • Result posted to a Postgres synthetic_results table; metrics scraped by Prometheus.

We run six synthetics every minute (one per vertical) plus three transactional flows every 15 minutes:

  • Healthcare FastAPI :8084 — synthetic calls 555-0100, says "I need to verify my insurance," expects intent insurance_verification.
  • Real Estate — synthetic asks "do you have a 3-bedroom listing in Austin?" expects intent property_search and a successful tool call to the listings DB.
  • Sales — synthetic plays the pricing question; checks that the agent quotes $149 / $499 / $1499 from /pricing.
  • After-hours Bull/Redis queue — synthetic schedules a callback and verifies the queued job exists.

Costs: ~$3.20/day per vertical for STT + gpt-4o-mini judging. Cheap enough to run forever.

We expose the synthetic dashboard publicly at status.callsphere.ai. $1499 enterprise tier gets per-tenant synthetics. Try the 14-day trial.

Implementation

  1. Caller bot in Go opening a WebRTC peer to your edge.
pc, _ := webrtc.NewPeerConnection(cfg)
audioTrack, _ := webrtc.NewTrackLocalStaticSample(...)
pc.AddTrack(audioTrack)
go playOpus(audioTrack, "fixtures/insurance_q.opus")
  1. Capture agent audio, hand to Deepgram, score:
text = deepgram.transcribe(agent_audio)
verdict = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role":"user", "content": f"Does this response answer 'insurance verification'? Reply yes or no.\n\n{text}"}],
)
  1. Persist + alert.
INSERT INTO synthetic_results (vertical, ftl_ms, intent_ok, ts)
VALUES ('healthcare', 720, true, NOW());
  1. Alertmanager alerts on 3 consecutive failures or FTL p95 > 1200ms.

    Still reading? Stop comparing — try CallSphere live.

    CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  2. Replay on regression. Every failed synthetic auto-creates a Linear ticket with the audio and the trace.

FAQ

Q: Can I use Datadog Synthetics for voice? A: Their browser test can hit a WebRTC page; not a clean fit for SIP/PSTN. We use Datadog Synthetics for our HTTP APIs and homemade for voice.

Q: How realistic should the test utterance be? A: Use real recorded voices, not TTS — TTS hits the model differently and gives misleadingly high scores.

Q: Won't synthetics inflate my OpenAI bill? A: We see ~$0.15/check on gpt-4o-realtime. Six verticals × 1440 checks/day = ~$1300/mo across all. Worth it.

Q: How do I keep synthetics out of business metrics? A: Tag every synthetic call with x-synthetic: true on the SIP INVITE; filter from analytics rollups.

Q: What about Checkly? A: Great for HTTP/Playwright API checks (we use it for our /api/admin/* routes). Not voice.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.