Alert Routing for AI Agent Failures: PagerDuty, Opsgenie, and Beyond
Routing the right alert to the right human at 3am is hard. Routing AI-agent alerts is harder — half are model regressions, not infra. Here's a working routing tree for voice and chat agents.
TL;DR — AI-agent alerts come from three sources: infra, model, and tool. Each routes to a different team. The biggest mistake is sending model-quality alerts to the platform on-call.
What goes wrong
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]Atlassian announced in March 2025 that Opsgenie would be absorbed into Jira Service Management and Compass — most teams in 2026 are mid-migration. PagerDuty added an SRE Agent that auto-investigates incidents. Both still struggle with AI-agent alerts because the failure surface is not what their routing trees were built for.
Three categories of alert that need different humans:
- Infra failures — pod crashloop, Postgres replica lag, Redis OOM. Page the platform on-call. Same as classic SRE.
- Model regressions — intent accuracy fell 4% after a prompt change. Page the AI engineer who owns the prompt.
- Tool failures — CRM API rate-limited, calendar webhook timing out. Page the integration owner.
A noisy alert lands in the wrong queue, gets ack'd by someone who can't fix it, and rots while customers churn. The classic Opsgenie failure mode — conflicting routing rules, multi-team duplication — gets worse with AI because the same symptom (low conversational success) can have all three causes.
How to monitor
Build a routing tree where every alert has a type tag before it leaves Prometheus or Langfuse:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
alert_type=infra→ platform teamalert_type=model→ AI engineeringalert_type=tool→ integration owner (auto-routed by tool name)
Use Alertmanager for the first hop (it's cheap and reliable) and PagerDuty / Opsgenie for the on-call rotation, severity escalation, and status-page integration. Avoid the temptation to push everything through PagerDuty's UI — version-controlled YAML beats clickops every time.
CallSphere stack
CallSphere uses Alertmanager → PagerDuty for production, with Slack as the secondary channel. Six routing trees, one per vertical:
- Healthcare (FastAPI
:8084) — has the strictest SLO, so any FTL p95 breach pages immediately. Routes to the AI engineering rotation, with escalation to me (CEO) at SEV1. - Real Estate (6-container NATS pod) — tool-call alerts auto-route by tool name. CRM webhook 5xx → CRM owner; calendar API 429 → calendar owner.
- Sales (WebSocket + PM2) — backpressure alerts route to platform team because they're infra symptoms.
- After-hours (Bull/Redis queue) — queue-depth alerts route to platform; per-job model errors route to AI eng.
The AI eng team has a 24/7 rotation only at the $1499 enterprise tier; $499 has business-hours pager; $149 trial gets email-only. We publish status at status.callsphere.ai. Try the routing on the 14-day trial.
Implementation
- Tag every alert at source.
# prometheus/rules.yaml
- alert: HealthcareFTLP95High
expr: histogram_quantile(0.95, callsphere_ftl_ms_bucket{vertical="healthcare"}) > 800
for: 5m
labels:
severity: page
alert_type: model
team: ai-eng
vertical: healthcare
- Route in Alertmanager.
route:
receiver: default
routes:
- match: { alert_type: infra }
receiver: pd-platform
- match: { alert_type: model }
receiver: pd-ai-eng
- match: { alert_type: tool }
receiver: pd-integration-owners
group_by: [tool_name]
One PagerDuty service per team, not one per alert. Reuse escalation policies.
Auto-resolve fast. A model alert that recovers in 2 minutes shouldn't wake anyone — set
resolve_timeout: 3min Alertmanager.Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Run a weekly alert-noise review. Burn-rate alerts, not threshold alerts; multi-window multi-burn-rate is the standard.
FAQ
Q: Should I move from Opsgenie to JSM now? A: If you're a JSM customer, yes — Atlassian is migrating you anyway. If not, evaluate PagerDuty, incident.io, Rootly, FireHydrant.
Q: Should an AI SRE agent auto-remediate? A: For known runbook actions (restart pod, fail over Redis), yes. For prompt regressions, never — keep a human in the loop.
Q: How do I avoid pager fatigue from cost-spike alerts? A: Batch them at hourly granularity unless they exceed 5x baseline. Most cost spikes are not page-worthy.
Q: Do I need separate rotations for AI eng vs platform? A: Yes, once you have more than ~3 engineers. Skills don't overlap enough.
Q: How do customer-facing alerts work? A: Status page at status.callsphere.ai. $1499 plan customers get a private status page; affiliates on /affiliate get aggregate views.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.