---
title: "Cost Monitoring for Token-Burn Outliers in Voice and Chat Agents"
description: "Mean token cost lies. Cost distributions are right-skewed and a single runaway agent can blow your monthly budget. Z-score and IQR alerts in 2026 catch the spike at minute 5, not month-end."
canonical: https://callsphere.ai/blog/vw3c-cost-monitoring-token-burn-outliers
category: "AI Engineering"
tags: ["Cost Monitoring", "FinOps", "Anomaly Detection", "LLM"]
author: "CallSphere Team"
published: 2026-04-15T00:00:00.000Z
updated: 2026-05-07T09:59:38.174Z
---

# Cost Monitoring for Token-Burn Outliers in Voice and Chat Agents

> Mean token cost lies. Cost distributions are right-skewed and a single runaway agent can blow your monthly budget. Z-score and IQR alerts in 2026 catch the spike at minute 5, not month-end.

> **TL;DR** — Set up a 5-minute Z-score or IQR check against a 14-day rolling baseline. Threshold at 3.5σ. You'll catch every runaway agent before it costs you a thousand dollars.

## What goes wrong

```mermaid
flowchart LR
  Browser["Browser / Phone"] -- "WebSocket /ws" --> LB["Load Balancer
sticky session"]
  LB --> Pod1["Node A · Socket.IO"]
  LB --> Pod2["Node B · Socket.IO"]
  Pod1 -- "pub/sub" --> Redis[("Redis cluster")]
  Pod2 -- "pub/sub" --> Redis
  Pod1 --> AI["AI Worker · OpenAI Realtime"]
  Pod2 --> AI
```

CallSphere reference architecture

LLM cost distributions are right-skewed: most calls are cheap, a small fraction extreme. Arithmetic mean is misleading because outliers pull it up. The classic failure mode is a feedback loop where an agent calls a tool that returns a stale result, the agent retries, retries, retries, and burns 200k tokens on one user. By the time finance notices in the monthly bill, you've spent thousands.

In 2026 the standard fix is statistical anomaly detection on token velocity: compare current 5-minute window to a 14-day rolling baseline at the same hour-of-day. Fire at 3σ to 3.5σ deviations. Auto-allowlist approved models so any new model name is also an alert.

## How to monitor

Three layers of cost monitoring:

1. **Per-call cap** — hard limit per call (we use 8000 tokens for voice, 16000 for chat). Agent stops on cap; user sees graceful exit.
2. **Per-customer rate limit** — daily token budget per tenant. Returns 429 to API; voice calls degrade to a smaller model.
3. **Anomaly alerts** — Z-score on global token velocity; per-vertical and per-tool-call.

Use percentiles, not averages, on dashboards. p95 token cost per call is the metric to watch.

## CallSphere stack

CallSphere computes cost metrics in a Postgres rollup every 60 seconds. Each agent emits a span with `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, and `gen_ai.request.model`; the OTel collector forwards to both Langfuse and a Postgres exporter. A scheduled SQL function aggregates by 5-minute window and computes Z-score against the same window in the prior 14 days.

- **Healthcare FastAPI `:8084`** — hard cap 8000 tokens per call (gpt-4o-realtime). Tail-call cap kicks in via a system message that asks the agent to summarize.
- **Real Estate 6-container NATS pod** — per-tool cost cap (no single tool call > 1500 tokens of context).
- **Sales WebSocket + PM2** — per-customer daily limit synced to plan tier ($149 = 100k tokens/day, $499 = 1M, $1499 = unlimited).
- **After-hours Bull/Redis queue** — cost per job hard cap; over-cap jobs route to gpt-4o-mini fallback.

Real numbers: median voice call is $0.087; p95 is $0.31; p99 is $0.94. Anything above $3 fires a per-call alert. Try it on the [14-day trial](/trial); see costs broken down on [/pricing](/pricing).

## Implementation

1. **Tag every span** with cost.

```python
span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
span.set_attribute("callsphere.cost_usd", input_tokens * 0.000005 + output_tokens * 0.000015)
```

1. **Rollup table.**

```sql
CREATE MATERIALIZED VIEW cost_5m AS
SELECT
  date_trunc('minute', ts) - (date_part('minute', ts)::int % 5) * INTERVAL '1 minute' AS bucket,
  vertical,
  SUM(cost_usd) AS spend
FROM agent_spans
GROUP BY 1, 2;
```

1. **Z-score alert.**

```sql
SELECT vertical, spend,
  (spend - avg_baseline) / nullif(stddev_baseline, 0) AS z
FROM (
  SELECT c.vertical, c.spend,
    AVG(b.spend) AS avg_baseline,
    STDDEV(b.spend) AS stddev_baseline
  FROM cost_5m c
  JOIN cost_5m b ON b.bucket BETWEEN c.bucket - INTERVAL '14 days' AND c.bucket - INTERVAL '5 minutes'
    AND extract(hour from b.bucket) = extract(hour from c.bucket)
  WHERE c.bucket = (SELECT MAX(bucket) FROM cost_5m)
  GROUP BY 1, 2
) sub
WHERE z > 3.5;
```

1. **Allowlist models.** Any `gen_ai.request.model` not in our approved list pages immediately. Catches accidental `gpt-4-32k` shipments.
2. **Per-call cap** as a hard system instruction + token counter in the agent loop.

## FAQ

**Q: Why 3.5σ?**
A: 3σ at 12 windows/hour fires too often. 3.5σ catches the real spikes; we tune to 4σ during marketing pushes.

**Q: How do I tell a real spike from a viral signup?**
A: Combine Z-score with absolute floor (e.g., spike must also be > $50/5min). Saves false alarms.

**Q: Should I auto-throttle?**
A: Yes, at the per-customer level. Return 429 with a user-friendly message. Don't auto-throttle global without human approval.

**Q: Cost as an SLO?**
A: Yes — we treat it as a budget. See the error budget post for how it gates deploys.

**Q: What about embedding/vector costs?**
A: Roll them in; pgvector embedding calls hit the same OpenAI bill. Tag with `callsphere.op=embed`.

## Sources

- [Dev Journal — Detect LLM Cost Spikes with Statistical Anomaly Detection](https://earezki.com/ai-news/2026-04-02-your-llm-costs-spiked-400-last-night-heres-how-to-catch-it-in-one-api-call/)
- [OpenObserve — LLM Cost Monitoring](https://openobserve.ai/blog/llm-cost-monitoring/)
- [Silicon Data — LLM Cost Per Token 2026 Practical Guide](https://www.silicondata.com/blog/llm-cost-per-token)
- [Agile Leadership Day — Why Native LLM Token Cost Optimization Tools Fail](https://agileleadershipdayindia.org/blogs/agentic-ai-cost-finops/llm-token-cost-optimization-tools.html)

---

Source: https://callsphere.ai/blog/vw3c-cost-monitoring-token-burn-outliers