---
title: "Alert Routing for AI Agent Failures: PagerDuty, Opsgenie, and Beyond"
description: "Routing the right alert to the right human at 3am is hard. Routing AI-agent alerts is harder — half are model regressions, not infra. Here's a working routing tree for voice and chat agents."
canonical: https://callsphere.ai/blog/vw3c-alert-routing-ai-agent-failures-pagerduty-opsgenie
category: "AI Engineering"
tags: ["Alerting", "PagerDuty", "Opsgenie", "Incident Response"]
author: "CallSphere Team"
published: 2026-03-30T00:00:00.000Z
updated: 2026-05-07T09:59:38.164Z
---

# Alert Routing for AI Agent Failures: PagerDuty, Opsgenie, and Beyond

> Routing the right alert to the right human at 3am is hard. Routing AI-agent alerts is harder — half are model regressions, not infra. Here's a working routing tree for voice and chat agents.

> **TL;DR** — AI-agent alerts come from three sources: infra, model, and tool. Each routes to a different team. The biggest mistake is sending model-quality alerts to the platform on-call.

## What goes wrong

```mermaid
flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]
```

CallSphere reference architecture

Atlassian announced in March 2025 that Opsgenie would be absorbed into Jira Service Management and Compass — most teams in 2026 are mid-migration. PagerDuty added an SRE Agent that auto-investigates incidents. Both still struggle with AI-agent alerts because the failure surface is *not* what their routing trees were built for.

Three categories of alert that need different humans:

1. **Infra failures** — pod crashloop, Postgres replica lag, Redis OOM. Page the platform on-call. Same as classic SRE.
2. **Model regressions** — intent accuracy fell 4% after a prompt change. Page the AI engineer who owns the prompt.
3. **Tool failures** — CRM API rate-limited, calendar webhook timing out. Page the integration owner.

A noisy alert lands in the wrong queue, gets ack'd by someone who can't fix it, and rots while customers churn. The classic Opsgenie failure mode — conflicting routing rules, multi-team duplication — gets worse with AI because the same symptom (low conversational success) can have all three causes.

## How to monitor

Build a routing tree where every alert has a *type tag* before it leaves Prometheus or Langfuse:

- `alert_type=infra` → platform team
- `alert_type=model` → AI engineering
- `alert_type=tool` → integration owner (auto-routed by tool name)

Use Alertmanager for the first hop (it's cheap and reliable) and PagerDuty / Opsgenie for the on-call rotation, severity escalation, and status-page integration. Avoid the temptation to push everything through PagerDuty's UI — version-controlled YAML beats clickops every time.

## CallSphere stack

CallSphere uses Alertmanager → PagerDuty for production, with Slack as the secondary channel. Six routing trees, one per vertical:

- **Healthcare** (FastAPI `:8084`) — has the strictest SLO, so any FTL p95 breach pages immediately. Routes to the AI engineering rotation, with escalation to me (CEO) at SEV1.
- **Real Estate** (6-container NATS pod) — tool-call alerts auto-route by tool name. CRM webhook 5xx → CRM owner; calendar API 429 → calendar owner.
- **Sales** (WebSocket + PM2) — backpressure alerts route to platform team because they're infra symptoms.
- **After-hours** (Bull/Redis queue) — queue-depth alerts route to platform; per-job model errors route to AI eng.

The AI eng team has a 24/7 rotation only at the $1499 enterprise tier; $499 has business-hours pager; $149 trial gets email-only. We publish status at status.callsphere.ai. Try the routing on the [14-day trial](/trial).

## Implementation

1. **Tag every alert at source.**

```yaml
# prometheus/rules.yaml
- alert: HealthcareFTLP95High
  expr: histogram_quantile(0.95, callsphere_ftl_ms_bucket{vertical="healthcare"}) > 800
  for: 5m
  labels:
    severity: page
    alert_type: model
    team: ai-eng
    vertical: healthcare
```

1. **Route in Alertmanager.**

```yaml
route:
  receiver: default
  routes:
    - match: { alert_type: infra }
      receiver: pd-platform
    - match: { alert_type: model }
      receiver: pd-ai-eng
    - match: { alert_type: tool }
      receiver: pd-integration-owners
      group_by: [tool_name]
```

1. **One PagerDuty service per team**, not one per alert. Reuse escalation policies.
2. **Auto-resolve fast.** A model alert that recovers in 2 minutes shouldn't wake anyone — set `resolve_timeout: 3m` in Alertmanager.
3. **Run a weekly alert-noise review.** Burn-rate alerts, not threshold alerts; multi-window multi-burn-rate is the standard.

## FAQ

**Q: Should I move from Opsgenie to JSM now?**
A: If you're a JSM customer, yes — Atlassian is migrating you anyway. If not, evaluate PagerDuty, incident.io, Rootly, FireHydrant.

**Q: Should an AI SRE agent auto-remediate?**
A: For known runbook actions (restart pod, fail over Redis), yes. For prompt regressions, never — keep a human in the loop.

**Q: How do I avoid pager fatigue from cost-spike alerts?**
A: Batch them at hourly granularity unless they exceed 5x baseline. Most cost spikes are not page-worthy.

**Q: Do I need separate rotations for AI eng vs platform?**
A: Yes, once you have more than ~3 engineers. Skills don't overlap enough.

**Q: How do customer-facing alerts work?**
A: Status page at status.callsphere.ai. $1499 plan customers get a private status page; affiliates on [/affiliate](/affiliate) get aggregate views.

## Sources

- [PagerDuty — AI-First Operations](https://www.pagerduty.com/)
- [Aurora — Opsgenie 2026 Features, Pricing, EOL](https://www.arvoai.ca/blog/opsgenie-complete-guide-2026)
- [Rootly — AI driven platforms outperform PagerDuty 2026](https://rootly.com/sre/ai-driven-platforms-outperform-pagerduty-2026)
- [CloudQA — Integrating Synthetic Alerts with Opsgenie PagerDuty Slack](https://cloudqa.io/integrating-synthetic-alerts-with-opsgenie-pagerduty-and-slack/)

---

Source: https://callsphere.ai/blog/vw3c-alert-routing-ai-agent-failures-pagerduty-opsgenie