By Sagar Shankaran, Founder of CallSphere
Picking the right SLIs for a voice agent is harder than picking SLIs for a REST API. Here's how CallSphere defines first-token latency, intent accuracy, and call-success rate across six verticals.
Key takeaways
TL;DR — A voice agent that's "up" but takes 2.4s to start speaking is worse than one that's down. Pick SLIs that capture conversational quality, not just HTTP 200s.
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]The classic SRE playbook says: pick a few SLIs, set targets, track error budgets. For a stateless API, "request success rate at p99 < 200ms" works. For a voice agent, that single SLI hides every failure that matters. We've watched calls succeed at the HTTP layer while the agent stayed silent for 4 seconds, then hallucinated a price, then hung up. Every byte was 200 OK.
The root cause is that voice agents have multiple layers of "success." The WebSocket can be healthy while the model is slow. The model can be fast while the speech-to-text is wrong. STT can be perfect while the TTS voice picks the wrong language. Picking one SLI hides three other failure modes. ITU-T G.114 sets 150 ms one-way delay as optimal for real-time conversation, and anything past 300 ms breaks the human-perceptible turn-taking loop — but most teams never measure first-token-out at the audio frame level.
A production voice agent needs at least four SLIs, each with its own SLO and error budget:
Latency SLIs should be measured at percentiles, not means. LLM cost and latency distributions are right-skewed — a handful of long calls drag the mean up and hide the 80% of users who had a great experience.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
CallSphere runs six vertical voice agents on a single k3s cluster behind Cloudflare Tunnel. We track all four SLIs per vertical because the targets differ wildly:
:8084) — FTL p95 must be under 700ms because clinicians barge-in fast; intent accuracy floor is 96% because a wrong med name is a P1.37 agents and 90+ tools across 115+ DB tables means we keep a per-agent SLO file in Postgres and emit gauges to Prometheus on every span. Customers on /pricing plans ($149 / $499 / $1499) get visibility into their own SLO dashboard; agency partners on the /affiliate plan get aggregated rollups.
# slos/healthcare.yaml
slis:
ftl_ms:
type: latency
objective: p95_lt_700
window: 28d
conv_success:
type: ratio
numerator: events.completion_ok
denominator: events.call_started
objective: 95
intent_accuracy:
type: ratio
sample_rate: 0.01
judge: gpt-4o-mini
objective: 96
audio_uptime:
type: availability
gap_threshold_ms: 500
objective: 99.5
Emit one OpenTelemetry span per turn — gen_ai.agent.turn — with attributes gen_ai.usage.input_tokens, callsphere.ftl_ms, callsphere.intent_match. Use the OTel GenAI semconv where it exists; namespace your custom ones.
Compute SLIs in a 1-minute rollup job. Don't compute on the read path — Prometheus will OOM. Use a Postgres CTE or a Materialize view. CallSphere uses a Postgres scheduled function that writes to a sli_rollups table.
Wire SLOs into deploys. Block deploys when the rolling 7-day error budget is < 25% remaining. We use an OPA policy in our k3s admission controller.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Show the user. A live SLO board on /admin/sre is non-optional — your team will only respect SLOs they can see.
Q: Why p95 instead of p99 for first-token latency? A: p99 in voice is dominated by network tail (mobile-radio re-attach, ICE restart). Track it, but alert on p95 — it's the more honest signal of your code.
Q: Can I use only an HTTP 5xx rate? A: No. We've seen agents return 200 with empty audio for 30 seconds. Use a turn-level success ratio instead.
Q: How do I sample for intent accuracy without leaking PII? A: Sample 1% of turns, redact PII at the trace exporter (we use Microsoft Presidio), and run an LLM-as-judge with the redacted text. Spot-check 10/week with a human.
Q: What about TTS voice quality? A: It's a real SLI, but it's hard to measure cheaply. Use synthetic monitoring (see the synthetic post) with MOS-style scoring.
Q: Do I need separate SLOs per customer? A: For $1499 enterprise tier, yes. We carve out per-tenant SLOs. $149 starter gets the global SLO. Try it on the 14-day trial.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to texto a voz (text-to-speech in Spanish): LATAM vs Castilian voices, free options, and how CallSphere ships Spanish agents.
A founder's guide to the female voice generator landscape: AI female voices, Japanese voices, robot voices, and how CallSphere ships 57+ voices live.
A founder's guide to the Siri voice generator landscape: how AI voice cloning works, what is legal, and how CallSphere uses 57+ voices in production.
A founder's guide to AI voice assistants for ecommerce: customer service, order lookup, and how CallSphere fits in versus virtual receptionists.
Robot text to speech in 2026: how I pick TTS APIs, when robotic voices help, and how CallSphere ships 57+ language voice agents. Hands-on guide.
The customer support specialist role in 2026 is half human, half AI. Here is what the job looks like, the AI tools that pair with it, and how we ship it.
© 2026 CallSphere LLC. All rights reserved.