TL;DR — A voice agent that's "up" but takes 2.4s to start speaking is worse than one that's down. Pick SLIs that capture conversational quality, not just HTTP 200s.

What goes wrong

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]

CallSphere reference architecture

The classic SRE playbook says: pick a few SLIs, set targets, track error budgets. For a stateless API, "request success rate at p99 < 200ms" works. For a voice agent, that single SLI hides every failure that matters. We've watched calls succeed at the HTTP layer while the agent stayed silent for 4 seconds, then hallucinated a price, then hung up. Every byte was 200 OK.

The root cause is that voice agents have multiple layers of "success." The WebSocket can be healthy while the model is slow. The model can be fast while the speech-to-text is wrong. STT can be perfect while the TTS voice picks the wrong language. Picking one SLI hides three other failure modes. ITU-T G.114 sets 150 ms one-way delay as optimal for real-time conversation, and anything past 300 ms breaks the human-perceptible turn-taking loop — but most teams never measure first-token-out at the audio frame level.

How to monitor

A production voice agent needs at least four SLIs, each with its own SLO and error budget:

First-token latency (FTL) — milliseconds from end-of-user-speech to first audio frame from the agent. Target: p95 < 800ms, p99 < 1500ms. This is the single most user-visible metric.
Conversational success rate — percent of calls that reach a defined "completion" event (booking confirmed, transfer succeeded, intent resolved). Target: 95% over a rolling 28-day window.
Intent accuracy — percent of utterances where the agent's chosen tool/intent matches the human-graded ground truth. Target: 92% on a sampled 1% of production traffic, judged by an LLM-as-judge plus weekly human spot-check.
Audio uptime — percent of session-seconds with continuous bidirectional audio (no >500ms gaps). Target: 99.5%.

Latency SLIs should be measured at percentiles, not means. LLM cost and latency distributions are right-skewed — a handful of long calls drag the mean up and hide the 80% of users who had a great experience.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

CallSphere stack

CallSphere runs six vertical voice agents on a single k3s cluster behind Cloudflare Tunnel. We track all four SLIs per vertical because the targets differ wildly:

Healthcare (FastAPI on :8084) — FTL p95 must be under 700ms because clinicians barge-in fast; intent accuracy floor is 96% because a wrong med name is a P1.
Real Estate — 6-container pod with NATS for tool-calling fan-out; FTL relaxes to 1000ms because lead-qualification calls tolerate it, but conversational success has to clear 92% to hit our 22% affiliate payout SLA.
Sales — WebSocket gateway on PM2 with 8 workers; intent accuracy is the dominant SLI because mis-quoting a price violates our 14-day trial guarantee.
After-hours — Bull/Redis queue, async by design, so the SLI shifts to "voicemail processed within 60s" instead of FTL.

37 agents and 90+ tools across 115+ DB tables means we keep a per-agent SLO file in Postgres and emit gauges to Prometheus on every span. Customers on /pricing plans ($149 / $499 / $1499) get visibility into their own SLO dashboard; agency partners on the /affiliate plan get aggregated rollups.

Implementation

Define your SLI dictionary in code. One YAML file per vertical, version-controlled.

# slos/healthcare.yaml
slis:
  ftl_ms:
    type: latency
    objective: p95_lt_700
    window: 28d
  conv_success:
    type: ratio
    numerator: events.completion_ok
    denominator: events.call_started
    objective: 95
  intent_accuracy:
    type: ratio
    sample_rate: 0.01
    judge: gpt-4o-mini
    objective: 96
  audio_uptime:
    type: availability
    gap_threshold_ms: 500
    objective: 99.5

Emit one OpenTelemetry span per turn — gen_ai.agent.turn — with attributes gen_ai.usage.input_tokens, callsphere.ftl_ms, callsphere.intent_match. Use the OTel GenAI semconv where it exists; namespace your custom ones.
Compute SLIs in a 1-minute rollup job. Don't compute on the read path — Prometheus will OOM. Use a Postgres CTE or a Materialize view. CallSphere uses a Postgres scheduled function that writes to a sli_rollups table.
Wire SLOs into deploys. Block deploys when the rolling 7-day error budget is < 25% remaining. We use an OPA policy in our k3s admission controller.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing
Show the user. A live SLO board on /admin/sre is non-optional — your team will only respect SLOs they can see.

FAQ

Q: Why p95 instead of p99 for first-token latency? A: p99 in voice is dominated by network tail (mobile-radio re-attach, ICE restart). Track it, but alert on p95 — it's the more honest signal of your code.

Q: Can I use only an HTTP 5xx rate? A: No. We've seen agents return 200 with empty audio for 30 seconds. Use a turn-level success ratio instead.

Q: How do I sample for intent accuracy without leaking PII? A: Sample 1% of turns, redact PII at the trace exporter (we use Microsoft Presidio), and run an LLM-as-judge with the redacted text. Spot-check 10/week with a human.

Q: What about TTS voice quality? A: It's a real SLI, but it's hard to measure cheaply. Use synthetic monitoring (see the synthetic post) with MOS-style scoring.

Q: Do I need separate SLOs per customer? A: For $1499 enterprise tier, yes. We carve out per-tenant SLOs. $149 starter gets the global SLO. Try it on the 14-day trial.

SLO and SLI Definitions for AI Voice Agents: Latency, Accuracy, Uptime

What goes wrong

How to monitor

CallSphere stack

Implementation

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

Logistics Dispatch Voice Agent 2026: Driver Hotline + Load Assignment Hands-Free

Voice AI market April 2026 roundup — CallSphere, Vapi, Retell

Voice Agent + CRM in 2026: Salesforce, HubSpot, and the API Limit Trap

Agent Memory for Multilingual Call-Center Agents: Real Patterns