A Postmortem Template for AI Agent Incidents
Standard SRE postmortems miss the half of an AI incident that matters: why did the agent decide that. Here's the template CallSphere has run for 11 production incidents in 12 months.
TL;DR — A good AI postmortem has eight sections. The one most teams skip is "Why didn't we detect this sooner?" Median time-to-detect for agent incidents is 14 days, not 14 minutes.
What goes wrong
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]Teams run agent incidents through their old Google-style postmortem template. They find the bug in code or a prompt, write up "we shipped a bad change, we'll add a regression test," and move on. They miss two things:
- The detection chain — agent failures look identical to authorized activity in audit logs. Without a tripwire designed for agent-distinct patterns, the incident shows up only when a customer notices.
- Model behavior root cause — the model is a non-deterministic dependency. "Why did the model choose this tool?" is part of the incident, not a footnote.
In 2025, one widely-shared postmortem covered an agent that burned $4,200 in 63 hours before anyone noticed. The detection was a credit-card alert. That's classic AI-agent failure mode.
How to monitor
Adopt a postmortem template with these eight sections:
- Summary & impact — what happened, who was affected, dollar/customer impact.
- Timeline — UTC timestamps from first symptom to resolution.
- Detection chain — how did we find out, and what would have to change for the next instance to be caught in 4 hours not 14 days.
- Root cause — both code/config AND model behavior cause if applicable.
- What went well.
- What went wrong.
- Action items — owner, due date, blast-radius lever (test, runbook, alert, code, prompt, eval).
- Blameless lessons.
Publish the postmortem to a public repo or wiki. Read it at the next all-hands. Track action item completion in Linear/Jira.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
CallSphere stack
We've run 11 postmortems in 12 months across our six verticals. The template lives in /docs/postmortems/ in the monorepo as Markdown — every PM is a PR. Senior engineers review every PM within 48 hours. Action items become Linear tickets with the postmortem URL in the description.
- Healthcare FastAPI
:8084— biggest incident was a prompt regression that increased hallucination of insurance plan names. Detection was a customer email; we now run an LLM-as-judge eval daily on a fixed test set. - Real Estate 6-container NATS pod — message-loss incident when NATS upgraded; we added queue-depth alerts and a chaos drill.
- Sales WebSocket / PM2 — restart storm when memory leaked; capped worker memory and added rolling restarts.
- After-hours Bull/Redis queue — Redis OOM during a backlog; added queue size budgets.
Every postmortem ends with a published detection_chain_minutes field. Median across 11 incidents went from 47 hours (first 5) to 38 minutes (last 6) once we made detection a first-class outcome.
$1499 enterprise tier on /pricing gets a copy of any postmortem that affects them within 72 hours. Try it on the 14-day trial.
Implementation
Template lives in Git. New incident →
cp template.md YYYY-MM-DD-short-title.md. PR. Review.Detection chain section is mandatory.
## Detection chain
- 2026-04-12 14:02 UTC — first impacted call
- 2026-04-12 16:51 UTC — automated alert fires (FTL p95 > 1500ms for 30m)
- 2026-04-12 16:53 UTC — on-call ack
- detection_chain_minutes: 169
### What would have detected this in <30 minutes?
- A 5-minute window FTL p95 alert (we had only 30m)
- An LLM-as-judge eval on a fresh sample (we ran daily, should be hourly)
Action items have owners and dates. No "we should consider…"
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Root cause is two-pronged.
## Root cause
### Code
- Prompt change in PR #4421 added a new tool description that overlapped with two existing tools.
### Model
- gpt-4o-realtime preferred the new tool 38% of the time even when the old tool was correct, because the new description matched the user phrase more literally.
- Track meta-metrics. Median detection time. Repeated incident classes. Action item completion rate. Aim for >90% closed within 30 days.
FAQ
Q: Should the model vendor be in the postmortem? A: Yes if their behavior was a contributing factor. We've named OpenAI in two postmortems.
Q: How do I keep it blameless? A: Focus on systems, not people. "Our deploy process didn't catch this" not "Alice missed it."
Q: Are postmortems public? A: Internally always; externally for SEV1 with customer impact. We publish redacted versions.
Q: How long is too long for a postmortem? A: Aim for 1500 words. Longer ones don't get read. Link to the trace and the eval results.
Q: Should I use an AI to draft the postmortem? A: A summarizer that pulls from incident channel, traces, and PRs is fine — a human writes the lessons.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.