Skip to content
AI Engineering
AI Engineering11 min read0 views

A Postmortem Template for AI Agent Incidents

Standard SRE postmortems miss the half of an AI incident that matters: why did the agent decide that. Here's the template CallSphere has run for 11 production incidents in 12 months.

TL;DR — A good AI postmortem has eight sections. The one most teams skip is "Why didn't we detect this sooner?" Median time-to-detect for agent incidents is 14 days, not 14 minutes.

What goes wrong

flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
CallSphere reference architecture

Teams run agent incidents through their old Google-style postmortem template. They find the bug in code or a prompt, write up "we shipped a bad change, we'll add a regression test," and move on. They miss two things:

  1. The detection chain — agent failures look identical to authorized activity in audit logs. Without a tripwire designed for agent-distinct patterns, the incident shows up only when a customer notices.
  2. Model behavior root cause — the model is a non-deterministic dependency. "Why did the model choose this tool?" is part of the incident, not a footnote.

In 2025, one widely-shared postmortem covered an agent that burned $4,200 in 63 hours before anyone noticed. The detection was a credit-card alert. That's classic AI-agent failure mode.

How to monitor

Adopt a postmortem template with these eight sections:

  1. Summary & impact — what happened, who was affected, dollar/customer impact.
  2. Timeline — UTC timestamps from first symptom to resolution.
  3. Detection chainhow did we find out, and what would have to change for the next instance to be caught in 4 hours not 14 days.
  4. Root cause — both code/config AND model behavior cause if applicable.
  5. What went well.
  6. What went wrong.
  7. Action items — owner, due date, blast-radius lever (test, runbook, alert, code, prompt, eval).
  8. Blameless lessons.

Publish the postmortem to a public repo or wiki. Read it at the next all-hands. Track action item completion in Linear/Jira.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

CallSphere stack

We've run 11 postmortems in 12 months across our six verticals. The template lives in /docs/postmortems/ in the monorepo as Markdown — every PM is a PR. Senior engineers review every PM within 48 hours. Action items become Linear tickets with the postmortem URL in the description.

  • Healthcare FastAPI :8084 — biggest incident was a prompt regression that increased hallucination of insurance plan names. Detection was a customer email; we now run an LLM-as-judge eval daily on a fixed test set.
  • Real Estate 6-container NATS pod — message-loss incident when NATS upgraded; we added queue-depth alerts and a chaos drill.
  • Sales WebSocket / PM2 — restart storm when memory leaked; capped worker memory and added rolling restarts.
  • After-hours Bull/Redis queue — Redis OOM during a backlog; added queue size budgets.

Every postmortem ends with a published detection_chain_minutes field. Median across 11 incidents went from 47 hours (first 5) to 38 minutes (last 6) once we made detection a first-class outcome.

$1499 enterprise tier on /pricing gets a copy of any postmortem that affects them within 72 hours. Try it on the 14-day trial.

Implementation

  1. Template lives in Git. New incident → cp template.md YYYY-MM-DD-short-title.md. PR. Review.

  2. Detection chain section is mandatory.

## Detection chain

- 2026-04-12 14:02 UTC — first impacted call
- 2026-04-12 16:51 UTC — automated alert fires (FTL p95 > 1500ms for 30m)
- 2026-04-12 16:53 UTC — on-call ack
- detection_chain_minutes: 169

### What would have detected this in <30 minutes?
- A 5-minute window FTL p95 alert (we had only 30m)
- An LLM-as-judge eval on a fresh sample (we ran daily, should be hourly)
  1. Action items have owners and dates. No "we should consider…"

    Still reading? Stop comparing — try CallSphere live.

    CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  2. Root cause is two-pronged.

## Root cause

### Code
- Prompt change in PR #4421 added a new tool description that overlapped with two existing tools.

### Model
- gpt-4o-realtime preferred the new tool 38% of the time even when the old tool was correct, because the new description matched the user phrase more literally.
  1. Track meta-metrics. Median detection time. Repeated incident classes. Action item completion rate. Aim for >90% closed within 30 days.

FAQ

Q: Should the model vendor be in the postmortem? A: Yes if their behavior was a contributing factor. We've named OpenAI in two postmortems.

Q: How do I keep it blameless? A: Focus on systems, not people. "Our deploy process didn't catch this" not "Alice missed it."

Q: Are postmortems public? A: Internally always; externally for SEV1 with customer impact. We publish redacted versions.

Q: How long is too long for a postmortem? A: Aim for 1500 words. Longer ones don't get read. Link to the trace and the eval results.

Q: Should I use an AI to draft the postmortem? A: A summarizer that pulls from incident channel, traces, and PRs is fine — a human writes the lessons.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like