By Sagar Shankaran, Founder of CallSphere
Routing the right alert to the right human at 3am is hard. Routing AI-agent alerts is harder — half are model regressions, not infra. Here's a working routing tree for voice and chat agents.
Key takeaways
TL;DR — AI-agent alerts come from three sources: infra, model, and tool. Each routes to a different team. The biggest mistake is sending model-quality alerts to the platform on-call.
flowchart TD
Client[Client] --> Edge[Cloudflare Worker]
Edge -->|WS upgrade| DO[Durable Object]
DO --> AI[(OpenAI Realtime WS)]
AI --> DO
DO --> Client
DO -.hibernation.-> Storage[(Persisted state)]Atlassian announced in March 2025 that Opsgenie would be absorbed into Jira Service Management and Compass — most teams in 2026 are mid-migration. PagerDuty added an SRE Agent that auto-investigates incidents. Both still struggle with AI-agent alerts because the failure surface is not what their routing trees were built for.
Three categories of alert that need different humans:
A noisy alert lands in the wrong queue, gets ack'd by someone who can't fix it, and rots while customers churn. The classic Opsgenie failure mode — conflicting routing rules, multi-team duplication — gets worse with AI because the same symptom (low conversational success) can have all three causes.
Build a routing tree where every alert has a type tag before it leaves Prometheus or Langfuse:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
alert_type=infra → platform teamalert_type=model → AI engineeringalert_type=tool → integration owner (auto-routed by tool name)Use Alertmanager for the first hop (it's cheap and reliable) and PagerDuty / Opsgenie for the on-call rotation, severity escalation, and status-page integration. Avoid the temptation to push everything through PagerDuty's UI — version-controlled YAML beats clickops every time.
CallSphere uses Alertmanager → PagerDuty for production, with Slack as the secondary channel. Six routing trees, one per vertical:
:8084) — has the strictest SLO, so any FTL p95 breach pages immediately. Routes to the AI engineering rotation, with escalation to me (CEO) at SEV1.The AI eng team has a 24/7 rotation only at the $1499 enterprise tier; $499 has business-hours pager; $149 trial gets email-only. We publish status at status.callsphere.ai. Try the routing on the 14-day trial.
# prometheus/rules.yaml
- alert: HealthcareFTLP95High
expr: histogram_quantile(0.95, callsphere_ftl_ms_bucket{vertical="healthcare"}) > 800
for: 5m
labels:
severity: page
alert_type: model
team: ai-eng
vertical: healthcare
route:
receiver: default
routes:
- match: { alert_type: infra }
receiver: pd-platform
- match: { alert_type: model }
receiver: pd-ai-eng
- match: { alert_type: tool }
receiver: pd-integration-owners
group_by: [tool_name]
One PagerDuty service per team, not one per alert. Reuse escalation policies.
Auto-resolve fast. A model alert that recovers in 2 minutes shouldn't wake anyone — set resolve_timeout: 3m in Alertmanager.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Run a weekly alert-noise review. Burn-rate alerts, not threshold alerts; multi-window multi-burn-rate is the standard.
Q: Should I move from Opsgenie to JSM now? A: If you're a JSM customer, yes — Atlassian is migrating you anyway. If not, evaluate PagerDuty, incident.io, Rootly, FireHydrant.
Q: Should an AI SRE agent auto-remediate? A: For known runbook actions (restart pod, fail over Redis), yes. For prompt regressions, never — keep a human in the loop.
Q: How do I avoid pager fatigue from cost-spike alerts? A: Batch them at hourly granularity unless they exceed 5x baseline. Most cost spikes are not page-worthy.
Q: Do I need separate rotations for AI eng vs platform? A: Yes, once you have more than ~3 engineers. Skills don't overlap enough.
Q: How do customer-facing alerts work? A: Status page at status.callsphere.ai. $1499 plan customers get a private status page; affiliates on /affiliate get aggregate views.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Postmortems for agentic incidents need new sections. The 2026 retro template for incidents where the LLM was the proximate cause.
Twilio Notify reached end-of-life Dec 31 2025. We map the 2026 replacement stack — Conversations + Conversation Orchestrator + Messaging Service + push providers — and how CallSphere fans alerts to voice, SMS, WhatsApp, and email.
Sentiment alerting is easy to ship and hard to make useful. We cover thresholding, debouncing, baseline drift, and a Slack/PagerDuty integration that doesn't generate alert fatigue. Includes the SQL we use at CallSphere.
Standard SRE postmortems miss the half of an AI incident that matters: why did the agent decide that. Here's the template CallSphere has run for 11 production incidents in 12 months.
Audit logs at 45 CFR 164.312(b) plus security monitoring at 45 CFR 164.308(a)(1)(ii)(D) plus 60-day breach clocks. Here is the 2026 logging and SOC architecture for AI voice platforms.
Slack now ships an official remote MCP and the duolingo/slack-mcp + korotovsky/slack-mcp-server forks dominate open source. Patterns for on-call agents, knowledge retrieval, and the OAuth flow.
© 2026 CallSphere LLC. All rights reserved.