---
title: "SLO and SLI Definitions for AI Voice Agents: Latency, Accuracy, Uptime"
description: "Picking the right SLIs for a voice agent is harder than picking SLIs for a REST API. Here's how CallSphere defines first-token latency, intent accuracy, and call-success rate across six verticals."
canonical: https://callsphere.ai/blog/vw3c-slo-sli-definitions-ai-voice-agents-latency-accuracy
category: "AI Engineering"
tags: ["SRE", "SLO", "SLI", "Voice AI", "Reliability"]
author: "CallSphere Team"
published: 2026-03-15T00:00:00.000Z
updated: 2026-05-07T09:59:38.156Z
---

# SLO and SLI Definitions for AI Voice Agents: Latency, Accuracy, Uptime

> Picking the right SLIs for a voice agent is harder than picking SLIs for a REST API. Here's how CallSphere defines first-token latency, intent accuracy, and call-success rate across six verticals.

> **TL;DR** — A voice agent that's "up" but takes 2.4s to start speaking is worse than one that's down. Pick SLIs that capture *conversational* quality, not just HTTP 200s.

## What goes wrong

```mermaid
flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]
```

CallSphere reference architecture

The classic SRE playbook says: pick a few SLIs, set targets, track error budgets. For a stateless API, "request success rate at p99 500ms gaps). Target: 99.5%.

Latency SLIs should be measured at percentiles, not means. LLM cost and latency distributions are right-skewed — a handful of long calls drag the mean up and hide the 80% of users who had a great experience.

## CallSphere stack

CallSphere runs six vertical voice agents on a single k3s cluster behind Cloudflare Tunnel. We track all four SLIs per vertical because the targets differ wildly:

- **Healthcare** (FastAPI on `:8084`) — FTL p95 must be under 700ms because clinicians barge-in fast; intent accuracy floor is 96% because a wrong med name is a P1.
- **Real Estate** — 6-container pod with NATS for tool-calling fan-out; FTL relaxes to 1000ms because lead-qualification calls tolerate it, but conversational success has to clear 92% to hit our 22% affiliate payout SLA.
- **Sales** — WebSocket gateway on PM2 with 8 workers; intent accuracy is the dominant SLI because mis-quoting a price violates our 14-day trial guarantee.
- **After-hours** — Bull/Redis queue, async by design, so the SLI shifts to "voicemail processed within 60s" instead of FTL.

37 agents and 90+ tools across 115+ DB tables means we keep a per-agent SLO file in Postgres and emit gauges to Prometheus on every span. Customers on `/pricing` plans ($149 / $499 / $1499) get visibility into their own SLO dashboard; agency partners on the `/affiliate` plan get aggregated rollups.

## Implementation

1. **Define your SLI dictionary in code.** One YAML file per vertical, version-controlled.

```yaml
# slos/healthcare.yaml
slis:
  ftl_ms:
    type: latency
    objective: p95_lt_700
    window: 28d
  conv_success:
    type: ratio
    numerator: events.completion_ok
    denominator: events.call_started
    objective: 95
  intent_accuracy:
    type: ratio
    sample_rate: 0.01
    judge: gpt-4o-mini
    objective: 96
  audio_uptime:
    type: availability
    gap_threshold_ms: 500
    objective: 99.5
```

1. **Emit one OpenTelemetry span per turn** — `gen_ai.agent.turn` — with attributes `gen_ai.usage.input_tokens`, `callsphere.ftl_ms`, `callsphere.intent_match`. Use the OTel GenAI semconv where it exists; namespace your custom ones.
2. **Compute SLIs in a 1-minute rollup job.** Don't compute on the read path — Prometheus will OOM. Use a Postgres CTE or a Materialize view. CallSphere uses a Postgres scheduled function that writes to a `sli_rollups` table.
3. **Wire SLOs into deploys.** Block deploys when the rolling 7-day error budget is < 25% remaining. We use an OPA policy in our k3s admission controller.
4. **Show the user.** A live SLO board on `/admin/sre` is non-optional — your team will only respect SLOs they can see.

## FAQ

**Q: Why p95 instead of p99 for first-token latency?**
A: p99 in voice is dominated by network tail (mobile-radio re-attach, ICE restart). Track it, but alert on p95 — it's the more honest signal of your code.

**Q: Can I use only an HTTP 5xx rate?**
A: No. We've seen agents return 200 with empty audio for 30 seconds. Use a turn-level success ratio instead.

**Q: How do I sample for intent accuracy without leaking PII?**
A: Sample 1% of turns, redact PII at the trace exporter (we use Microsoft Presidio), and run an LLM-as-judge with the redacted text. Spot-check 10/week with a human.

**Q: What about TTS voice quality?**
A: It's a real SLI, but it's hard to measure cheaply. Use synthetic monitoring (see the synthetic post) with MOS-style scoring.

**Q: Do I need separate SLOs per customer?**
A: For $1499 enterprise tier, yes. We carve out per-tenant SLOs. $149 starter gets the global SLO. Try it on the [14-day trial](/trial).

## Sources

- [Datadog — Best practices for managing your SLOs](https://www.datadoghq.com/blog/define-and-manage-slos/)
- [Grafana Cloud — SLI examples for latency](https://grafana.com/docs/grafana-cloud/alerting-and-irm/slo/sli-examples/latency/)
- [New Relic — SLOs, SLIs, and SLAs](https://newrelic.com/blog/observability/what-are-slos-slis-slas)
- [Atlassian — SLA vs SLO vs SLI](https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sli)

---

Source: https://callsphere.ai/blog/vw3c-slo-sli-definitions-ai-voice-agents-latency-accuracy