---
title: "Voice Agent Error Budgets: Defining Acceptable Failure"
description: "An error budget is the unreliability you allow yourself in exchange for shipping. For voice agents, the budget is dollars and minutes — not just nines. Here's how CallSphere computes one."
canonical: https://callsphere.ai/blog/vw3c-voice-agent-error-budgets-acceptable-failure
category: "AI Engineering"
tags: ["Error Budget", "SRE", "Voice AI", "Reliability"]
author: "CallSphere Team"
published: 2026-04-02T00:00:00.000Z
updated: 2026-05-07T09:59:38.166Z
---

# Voice Agent Error Budgets: Defining Acceptable Failure

> An error budget is the unreliability you allow yourself in exchange for shipping. For voice agents, the budget is dollars and minutes — not just nines. Here's how CallSphere computes one.

> **TL;DR** — Stop using 99.9% availability as your only error budget. Add a "model-regression budget," a "cost burn budget," and a "user-perceived-latency budget." Burn any of them and the next deploy is blocked.

## What goes wrong

```mermaid
flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]
```

CallSphere reference architecture

The classic Google SRE error budget — 1 minus your SLO target — was designed for stateless services where failure is binary. A voice agent fails in shades of gray. The call connected but the agent stalled for 4 seconds. The agent answered correctly but quoted last month's price. Token cost blew past forecast. None of these violate "availability" but all of them are expensive.

If you only track availability you'll burn through your real budget without any alarm firing. Then you'll ship a model swap that pushes accuracy from 95% to 91% and not notice for two weeks.

## How to monitor

Run multiple parallel error budgets, each with its own burn-rate alert:

1. **Availability budget** — 100% minus your audio-uptime SLO. Standard.
2. **Conversational success budget** — 100% minus your conv-success SLO. Burns when too many calls fail to complete.
3. **Latency budget** — fraction of turns above the FTL threshold. Burns when speed degrades.
4. **Quality budget** — fraction of turns where intent accuracy fell below threshold (sampled + LLM-as-judge). Burns on prompt or model regressions.
5. **Cost budget** — dollars spent on tokens above a forecast band. Burns on token-burn outliers.

The decision rule: if any budget is  NOW() - INTERVAL '7 days';
```

1. **Multi-burn-rate alerts.**

```yaml
- alert: ConvSuccessBudgetBurn
  expr: (1h_burn > 14.4 and 5m_burn > 14.4) or (6h_burn > 6 and 30m_burn > 6)
  labels: { severity: page, alert_type: model }
```

1. **Block deploys via OPA.** Admission webhook checks remaining budget on the relevant SLO before accepting pod manifests.
2. **Show the team.** A Grafana panel per budget on the SRE dashboard. Engineers will only respect what they see.
3. **Forgive intentionally.** A planned drill or migration consumes budget on purpose — log a "planned burn" event so the post-mortem doesn't blame anyone.

## FAQ

**Q: How tight should the cost budget be?**
A: ±15% of a 14-day rolling baseline is a sensible default. Tighter than that fires too often; looser misses spikes.

**Q: What if I exhaust the budget mid-week?**
A: Stop shipping risky changes. Use the rest of the week for reliability work. That's the entire point.

**Q: Should error budgets affect compensation?**
A: Indirectly — through the team's deploy velocity. Don't tie individual bonuses to a budget; you'll get gaming.

**Q: How do I forecast?**
A: Time-series forecasting on the burn rate. Even a simple Holt-Winters from Postgres + cron beats not forecasting.

**Q: Can I auto-escalate when a budget is < 10%?**
A: Yes — page the engineering manager. We do this at SEV2 with a Slack channel auto-created.

## Sources

- [Netdata — Understanding error budgets](https://www.netdata.cloud/academy/error-budget/)
- [BuildMVPFast — AI Agent Error Budgets](https://www.buildmvpfast.com/blog/ai-agent-error-budget-sre-reliability-autonomous-2026)
- [Agent Factory — SRE Foundations: SLIs, SLOs, error budgets](https://agentfactory.panaversity.org/docs/AI-Cloud-Native-Development/observability-cost-engineering/sre-foundations-slis-slos-error-budgets)
- [Nova AI Ops — AI SRE in 2026](https://novaaiops.com/ai-sre)

---

Source: https://callsphere.ai/blog/vw3c-voice-agent-error-budgets-acceptable-failure
