Synthetic Monitoring for Voice Agents: Checkly, Datadog, and Building Your Own
Real users generate noise. Synthetic checks generate signal. Here's how to run a fake voice call against your agent every minute and catch regressions before customers do.
TL;DR — Real-traffic SLOs detect regressions late. A 1-minute synthetic call detects them in 60 seconds. Combine both.
What goes wrong
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]Synthetic monitoring is well-understood for HTTP — Datadog Synthetics and Checkly let you run a Playwright script every minute and alert on failure. The same idea applied to voice is rarer, because nobody ships an "audio Playwright." A real synthetic voice check has to: place a phone call (or open a WebRTC peer), play a pre-recorded utterance, score the agent's response, and report metrics.
Without it, your first signal of a regression is a real customer call — at which point the bad experience is already shipped.
How to monitor
A synthetic voice check should test:
- Connect path — phone number rings, call is answered, audio negotiates.
- First-token latency — how long until the agent speaks back.
- Intent match — does the agent's first reply match the expected intent for the test utterance.
- Transactional path — can the agent complete a known booking/transfer flow.
- Cost — do not exceed N tokens or M cents per check.
Run one synthetic per vertical every minute. Run a longer transactional check every 15 minutes. Page on three consecutive failures.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
CallSphere stack
CallSphere built its own synthetic harness because off-the-shelf doesn't do voice well in 2026. Architecture:
- Caller bot in Go using Pion WebRTC and a pre-recorded Opus utterance.
- STT scoring via Deepgram (cheap and fast for synthetic).
- Intent classifier via gpt-4o-mini judging "did the response match expected intent."
- Result posted to a Postgres
synthetic_resultstable; metrics scraped by Prometheus.
We run six synthetics every minute (one per vertical) plus three transactional flows every 15 minutes:
- Healthcare FastAPI
:8084— synthetic calls 555-0100, says "I need to verify my insurance," expects intentinsurance_verification. - Real Estate — synthetic asks "do you have a 3-bedroom listing in Austin?" expects intent
property_searchand a successful tool call to the listings DB. - Sales — synthetic plays the pricing question; checks that the agent quotes $149 / $499 / $1499 from /pricing.
- After-hours Bull/Redis queue — synthetic schedules a callback and verifies the queued job exists.
Costs: ~$3.20/day per vertical for STT + gpt-4o-mini judging. Cheap enough to run forever.
We expose the synthetic dashboard publicly at status.callsphere.ai. $1499 enterprise tier gets per-tenant synthetics. Try the 14-day trial.
Implementation
- Caller bot in Go opening a WebRTC peer to your edge.
pc, _ := webrtc.NewPeerConnection(cfg)
audioTrack, _ := webrtc.NewTrackLocalStaticSample(...)
pc.AddTrack(audioTrack)
go playOpus(audioTrack, "fixtures/insurance_q.opus")
- Capture agent audio, hand to Deepgram, score:
text = deepgram.transcribe(agent_audio)
verdict = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user", "content": f"Does this response answer 'insurance verification'? Reply yes or no.\n\n{text}"}],
)
- Persist + alert.
INSERT INTO synthetic_results (vertical, ftl_ms, intent_ok, ts)
VALUES ('healthcare', 720, true, NOW());
Alertmanager alerts on 3 consecutive failures or FTL p95 > 1200ms.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Replay on regression. Every failed synthetic auto-creates a Linear ticket with the audio and the trace.
FAQ
Q: Can I use Datadog Synthetics for voice? A: Their browser test can hit a WebRTC page; not a clean fit for SIP/PSTN. We use Datadog Synthetics for our HTTP APIs and homemade for voice.
Q: How realistic should the test utterance be? A: Use real recorded voices, not TTS — TTS hits the model differently and gives misleadingly high scores.
Q: Won't synthetics inflate my OpenAI bill? A: We see ~$0.15/check on gpt-4o-realtime. Six verticals × 1440 checks/day = ~$1300/mo across all. Worth it.
Q: How do I keep synthetics out of business metrics?
A: Tag every synthetic call with x-synthetic: true on the SIP INVITE; filter from analytics rollups.
Q: What about Checkly?
A: Great for HTTP/Playwright API checks (we use it for our /api/admin/* routes). Not voice.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.