TL;DR — AI-agent alerts come from three sources: infra, model, and tool. Each routes to a different team. The biggest mistake is sending model-quality alerts to the platform on-call.

What goes wrong

flowchart TD
  Client[Client] --> Edge[Cloudflare Worker]
  Edge -->|WS upgrade| DO[Durable Object]
  DO --> AI[(OpenAI Realtime WS)]
  AI --> DO
  DO --> Client
  DO -.hibernation.-> Storage[(Persisted state)]

CallSphere reference architecture

Atlassian announced in March 2025 that Opsgenie would be absorbed into Jira Service Management and Compass — most teams in 2026 are mid-migration. PagerDuty added an SRE Agent that auto-investigates incidents. Both still struggle with AI-agent alerts because the failure surface is not what their routing trees were built for.

Three categories of alert that need different humans:

Infra failures — pod crashloop, Postgres replica lag, Redis OOM. Page the platform on-call. Same as classic SRE.
Model regressions — intent accuracy fell 4% after a prompt change. Page the AI engineer who owns the prompt.
Tool failures — CRM API rate-limited, calendar webhook timing out. Page the integration owner.

A noisy alert lands in the wrong queue, gets ack'd by someone who can't fix it, and rots while customers churn. The classic Opsgenie failure mode — conflicting routing rules, multi-team duplication — gets worse with AI because the same symptom (low conversational success) can have all three causes.

How to monitor

Build a routing tree where every alert has a type tag before it leaves Prometheus or Langfuse:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

alert_type=infra → platform team
alert_type=model → AI engineering
alert_type=tool → integration owner (auto-routed by tool name)

Use Alertmanager for the first hop (it's cheap and reliable) and PagerDuty / Opsgenie for the on-call rotation, severity escalation, and status-page integration. Avoid the temptation to push everything through PagerDuty's UI — version-controlled YAML beats clickops every time.

CallSphere stack

CallSphere uses Alertmanager → PagerDuty for production, with Slack as the secondary channel. Six routing trees, one per vertical:

Healthcare (FastAPI :8084) — has the strictest SLO, so any FTL p95 breach pages immediately. Routes to the AI engineering rotation, with escalation to me (CEO) at SEV1.
Real Estate (6-container NATS pod) — tool-call alerts auto-route by tool name. CRM webhook 5xx → CRM owner; calendar API 429 → calendar owner.
Sales (WebSocket + PM2) — backpressure alerts route to platform team because they're infra symptoms.
After-hours (Bull/Redis queue) — queue-depth alerts route to platform; per-job model errors route to AI eng.

The AI eng team has a 24/7 rotation only at the $1499 enterprise tier; $499 has business-hours pager; $149 trial gets email-only. We publish status at status.callsphere.ai. Try the routing on the 14-day trial.

Implementation

Tag every alert at source.

# prometheus/rules.yaml
- alert: HealthcareFTLP95High
  expr: histogram_quantile(0.95, callsphere_ftl_ms_bucket{vertical="healthcare"}) > 800
  for: 5m
  labels:
    severity: page
    alert_type: model
    team: ai-eng
    vertical: healthcare

Route in Alertmanager.

route:
  receiver: default
  routes:
    - match: { alert_type: infra }
      receiver: pd-platform
    - match: { alert_type: model }
      receiver: pd-ai-eng
    - match: { alert_type: tool }
      receiver: pd-integration-owners
      group_by: [tool_name]

One PagerDuty service per team, not one per alert. Reuse escalation policies.
Auto-resolve fast. A model alert that recovers in 2 minutes shouldn't wake anyone — set resolve_timeout: 3m in Alertmanager.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing
Run a weekly alert-noise review. Burn-rate alerts, not threshold alerts; multi-window multi-burn-rate is the standard.

FAQ

Q: Should I move from Opsgenie to JSM now? A: If you're a JSM customer, yes — Atlassian is migrating you anyway. If not, evaluate PagerDuty, incident.io, Rootly, FireHydrant.

Q: Should an AI SRE agent auto-remediate? A: For known runbook actions (restart pod, fail over Redis), yes. For prompt regressions, never — keep a human in the loop.

Q: How do I avoid pager fatigue from cost-spike alerts? A: Batch them at hourly granularity unless they exceed 5x baseline. Most cost spikes are not page-worthy.

Q: Do I need separate rotations for AI eng vs platform? A: Yes, once you have more than ~3 engineers. Skills don't overlap enough.

Q: How do customer-facing alerts work? A: Status page at status.callsphere.ai. $1499 plan customers get a private status page; affiliates on /affiliate get aggregate views.

Alert Routing for AI Agent Failures: PagerDuty, Opsgenie, and Beyond

What goes wrong

How to monitor

CallSphere stack

Implementation

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Agent Incident Retros: How to Run a Postmortem When an LLM Made the Mistake

A Postmortem Template for AI Agent Incidents

Logging and Monitoring for HIPAA Security Incidents in AI Voice Platforms

Workflow Observability: Monitoring, Alerting, and Debugging Agent Orchestration

Building an Incident Response Agent: Automated Triage, Diagnosis, and Remediation

Continuous Evaluation in Production: Real-Time Quality Monitoring for Deployed Agents