Skip to content
AI Engineering
AI Engineering10 min read0 views

Alert Routing for AI Agent Failures: PagerDuty, Opsgenie, and Beyond

Routing the right alert to the right human at 3am is hard. Routing AI-agent alerts is harder — half are model regressions, not infra. Here's a working routing tree for voice and chat agents.

TL;DR — AI-agent alerts come from three sources: infra, model, and tool. Each routes to a different team. The biggest mistake is sending model-quality alerts to the platform on-call.

What goes wrong

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]
CallSphere reference architecture

Atlassian announced in March 2025 that Opsgenie would be absorbed into Jira Service Management and Compass — most teams in 2026 are mid-migration. PagerDuty added an SRE Agent that auto-investigates incidents. Both still struggle with AI-agent alerts because the failure surface is not what their routing trees were built for.

Three categories of alert that need different humans:

  1. Infra failures — pod crashloop, Postgres replica lag, Redis OOM. Page the platform on-call. Same as classic SRE.
  2. Model regressions — intent accuracy fell 4% after a prompt change. Page the AI engineer who owns the prompt.
  3. Tool failures — CRM API rate-limited, calendar webhook timing out. Page the integration owner.

A noisy alert lands in the wrong queue, gets ack'd by someone who can't fix it, and rots while customers churn. The classic Opsgenie failure mode — conflicting routing rules, multi-team duplication — gets worse with AI because the same symptom (low conversational success) can have all three causes.

How to monitor

Build a routing tree where every alert has a type tag before it leaves Prometheus or Langfuse:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • alert_type=infra → platform team
  • alert_type=model → AI engineering
  • alert_type=tool → integration owner (auto-routed by tool name)

Use Alertmanager for the first hop (it's cheap and reliable) and PagerDuty / Opsgenie for the on-call rotation, severity escalation, and status-page integration. Avoid the temptation to push everything through PagerDuty's UI — version-controlled YAML beats clickops every time.

CallSphere stack

CallSphere uses Alertmanager → PagerDuty for production, with Slack as the secondary channel. Six routing trees, one per vertical:

  • Healthcare (FastAPI :8084) — has the strictest SLO, so any FTL p95 breach pages immediately. Routes to the AI engineering rotation, with escalation to me (CEO) at SEV1.
  • Real Estate (6-container NATS pod) — tool-call alerts auto-route by tool name. CRM webhook 5xx → CRM owner; calendar API 429 → calendar owner.
  • Sales (WebSocket + PM2) — backpressure alerts route to platform team because they're infra symptoms.
  • After-hours (Bull/Redis queue) — queue-depth alerts route to platform; per-job model errors route to AI eng.

The AI eng team has a 24/7 rotation only at the $1499 enterprise tier; $499 has business-hours pager; $149 trial gets email-only. We publish status at status.callsphere.ai. Try the routing on the 14-day trial.

Implementation

  1. Tag every alert at source.
# prometheus/rules.yaml
- alert: HealthcareFTLP95High
  expr: histogram_quantile(0.95, callsphere_ftl_ms_bucket{vertical="healthcare"}) > 800
  for: 5m
  labels:
    severity: page
    alert_type: model
    team: ai-eng
    vertical: healthcare
  1. Route in Alertmanager.
route:
  receiver: default
  routes:
    - match: { alert_type: infra }
      receiver: pd-platform
    - match: { alert_type: model }
      receiver: pd-ai-eng
    - match: { alert_type: tool }
      receiver: pd-integration-owners
      group_by: [tool_name]
  1. One PagerDuty service per team, not one per alert. Reuse escalation policies.

  2. Auto-resolve fast. A model alert that recovers in 2 minutes shouldn't wake anyone — set resolve_timeout: 3m in Alertmanager.

    Still reading? Stop comparing — try CallSphere live.

    CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  3. Run a weekly alert-noise review. Burn-rate alerts, not threshold alerts; multi-window multi-burn-rate is the standard.

FAQ

Q: Should I move from Opsgenie to JSM now? A: If you're a JSM customer, yes — Atlassian is migrating you anyway. If not, evaluate PagerDuty, incident.io, Rootly, FireHydrant.

Q: Should an AI SRE agent auto-remediate? A: For known runbook actions (restart pod, fail over Redis), yes. For prompt regressions, never — keep a human in the loop.

Q: How do I avoid pager fatigue from cost-spike alerts? A: Batch them at hourly granularity unless they exceed 5x baseline. Most cost spikes are not page-worthy.

Q: Do I need separate rotations for AI eng vs platform? A: Yes, once you have more than ~3 engineers. Skills don't overlap enough.

Q: How do customer-facing alerts work? A: Status page at status.callsphere.ai. $1499 plan customers get a private status page; affiliates on /affiliate get aggregate views.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Agent Incident Retros: How to Run a Postmortem When an LLM Made the Mistake

Postmortems for agentic incidents need new sections. The 2026 retro template for incidents where the LLM was the proximate cause.

AI Engineering

A Postmortem Template for AI Agent Incidents

Standard SRE postmortems miss the half of an AI incident that matters: why did the agent decide that. Here's the template CallSphere has run for 11 production incidents in 12 months.

AI Infrastructure

Logging and Monitoring for HIPAA Security Incidents in AI Voice Platforms

Audit logs at 45 CFR 164.312(b) plus security monitoring at 45 CFR 164.308(a)(1)(ii)(D) plus 60-day breach clocks. Here is the 2026 logging and SOC architecture for AI voice platforms.

Learn Agentic AI

Workflow Observability: Monitoring, Alerting, and Debugging Agent Orchestration

Learn how to build observability into AI agent orchestration systems. Covers dashboard design, metric collection, alert rules, trace correlation, and debugging strategies for agent workflows.

Learn Agentic AI

Building an Incident Response Agent: Automated Triage, Diagnosis, and Remediation

Learn how to build an AI agent that ingests alerts from monitoring systems, triages severity, runs diagnostic playbooks, attempts automated remediation, and generates post-incident reports.

Learn Agentic AI

Continuous Evaluation in Production: Real-Time Quality Monitoring for Deployed Agents

Learn how to implement continuous evaluation for production AI agents with sampling strategies, real-time quality dashboards, alerting on quality degradation, and feedback loops that drive iterative improvement.