---
title: "A Postmortem Template for AI Agent Incidents"
description: "Standard SRE postmortems miss the half of an AI incident that matters: why did the agent decide that. Here's the template CallSphere has run for 11 production incidents in 12 months."
canonical: https://callsphere.ai/blog/vw3c-postmortem-template-ai-agent-incidents
category: "AI Engineering"
tags: ["Postmortem", "Incident Response", "AI Agents", "Reliability"]
author: "CallSphere Team"
published: 2026-04-05T00:00:00.000Z
updated: 2026-05-07T09:59:38.168Z
---

# A Postmortem Template for AI Agent Incidents

> Standard SRE postmortems miss the half of an AI incident that matters: why did the agent decide that. Here's the template CallSphere has run for 11 production incidents in 12 months.

> **TL;DR** — A good AI postmortem has eight sections. The one most teams skip is "Why didn't we detect this sooner?" Median time-to-detect for agent incidents is 14 days, not 14 minutes.

## What goes wrong

```mermaid
flowchart LR
  Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
  Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
  OAI --> Bridge
  Bridge --> Twilio
  Bridge --> Logs[(structured logs · OTel)]
```

CallSphere reference architecture

Teams run agent incidents through their old Google-style postmortem template. They find the bug in code or a prompt, write up "we shipped a bad change, we'll add a regression test," and move on. They miss two things:

1. **The detection chain** — agent failures look identical to authorized activity in audit logs. Without a tripwire designed for *agent-distinct* patterns, the incident shows up only when a customer notices.
2. **Model behavior root cause** — the model is a non-deterministic dependency. "Why did the model choose this tool?" is part of the incident, not a footnote.

In 2025, one widely-shared postmortem covered an agent that burned $4,200 in 63 hours before anyone noticed. The detection was a credit-card alert. That's classic AI-agent failure mode.

## How to monitor

Adopt a postmortem template with these eight sections:

1. **Summary & impact** — what happened, who was affected, dollar/customer impact.
2. **Timeline** — UTC timestamps from first symptom to resolution.
3. **Detection chain** — *how* did we find out, and what would have to change for the next instance to be caught in 4 hours not 14 days.
4. **Root cause** — both code/config AND model behavior cause if applicable.
5. **What went well**.
6. **What went wrong**.
7. **Action items** — owner, due date, blast-radius lever (test, runbook, alert, code, prompt, eval).
8. **Blameless lessons**.

Publish the postmortem to a public repo or wiki. Read it at the next all-hands. Track action item completion in Linear/Jira.

## CallSphere stack

We've run 11 postmortems in 12 months across our six verticals. The template lives in `/docs/postmortems/` in the monorepo as Markdown — every PM is a PR. Senior engineers review every PM within 48 hours. Action items become Linear tickets with the postmortem URL in the description.

- **Healthcare FastAPI `:8084`** — biggest incident was a prompt regression that increased hallucination of insurance plan names. Detection was a customer email; we now run an LLM-as-judge eval daily on a fixed test set.
- **Real Estate 6-container NATS pod** — message-loss incident when NATS upgraded; we added queue-depth alerts and a chaos drill.
- **Sales WebSocket / PM2** — restart storm when memory leaked; capped worker memory and added rolling restarts.
- **After-hours Bull/Redis queue** — Redis OOM during a backlog; added queue size budgets.

Every postmortem ends with a published `detection_chain_minutes` field. Median across 11 incidents went from 47 hours (first 5) to 38 minutes (last 6) once we made detection a first-class outcome.

$1499 enterprise tier on [/pricing](/pricing) gets a copy of any postmortem that affects them within 72 hours. Try it on the [14-day trial](/trial).

## Implementation

1. **Template lives in Git.** New incident → `cp template.md YYYY-MM-DD-short-title.md`. PR. Review.
2. **Detection chain section is mandatory.**

```markdown
## Detection chain

- 2026-04-12 14:02 UTC — first impacted call
- 2026-04-12 16:51 UTC — automated alert fires (FTL p95 > 1500ms for 30m)
- 2026-04-12 16:53 UTC — on-call ack
- detection_chain_minutes: 169

### What would have detected this in 90% closed within 30 days.

## FAQ

**Q: Should the model vendor be in the postmortem?**
A: Yes if their behavior was a contributing factor. We've named OpenAI in two postmortems.

**Q: How do I keep it blameless?**
A: Focus on systems, not people. "Our deploy process didn't catch this" not "Alice missed it."

**Q: Are postmortems public?**
A: Internally always; externally for SEV1 with customer impact. We publish redacted versions.

**Q: How long is too long for a postmortem?**
A: Aim for 1500 words. Longer ones don't get read. Link to the trace and the eval results.

**Q: Should I use an AI to draft the postmortem?**
A: A summarizer that pulls from incident channel, traces, and PRs is fine — a human writes the lessons.

## Sources

- [AgentMode — Agent Incident Runbook](https://agentmodeai.com/resources/agent-incident-runbook/)
- [Medium — The Agent Incident Postmortem Template I reuse weekly](https://medium.com/@1nick1patel1/the-agent-incident-postmortem-template-i-reuse-every-week-a41c554e8fc8)
- [Medium — $4,200 in 63 Hours: a Production AI Postmortem](https://medium.com/@sattyamjain96/the-agent-that-burned-4-200-in-63-hours-a-production-ai-postmortem-d38fd9586a85)
- [incident.io — Best incident postmortem software 2026](https://incident.io/blog/best-incident-postmortem-software-2026-guide)

---

Source: https://callsphere.ai/blog/vw3c-postmortem-template-ai-agent-incidents
