By Sagar Shankaran, Founder of CallSphere
Standard SRE postmortems miss the half of an AI incident that matters: why did the agent decide that. Here's the template CallSphere has run for 11 production incidents in 12 months.
Key takeaways
TL;DR — A good AI postmortem has eight sections. The one most teams skip is "Why didn't we detect this sooner?" Median time-to-detect for agent incidents is 14 days, not 14 minutes.
flowchart LR
Twilio["Twilio Media Streams"] -- "WS · μlaw 8kHz" --> Bridge["FastAPI Bridge :8084"]
Bridge -- "PCM16 24kHz" --> OAI["OpenAI Realtime"]
OAI --> Bridge
Bridge --> Twilio
Bridge --> Logs[(structured logs · OTel)]Teams run agent incidents through their old Google-style postmortem template. They find the bug in code or a prompt, write up "we shipped a bad change, we'll add a regression test," and move on. They miss two things:
In 2025, one widely-shared postmortem covered an agent that burned $4,200 in 63 hours before anyone noticed. The detection was a credit-card alert. That's classic AI-agent failure mode.
Adopt a postmortem template with these eight sections:
Publish the postmortem to a public repo or wiki. Read it at the next all-hands. Track action item completion in Linear/Jira.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
We've run 11 postmortems in 12 months across our six verticals. The template lives in /docs/postmortems/ in the monorepo as Markdown — every PM is a PR. Senior engineers review every PM within 48 hours. Action items become Linear tickets with the postmortem URL in the description.
:8084 — biggest incident was a prompt regression that increased hallucination of insurance plan names. Detection was a customer email; we now run an LLM-as-judge eval daily on a fixed test set.Every postmortem ends with a published detection_chain_minutes field. Median across 11 incidents went from 47 hours (first 5) to 38 minutes (last 6) once we made detection a first-class outcome.
$1499 enterprise tier on /pricing gets a copy of any postmortem that affects them within 72 hours. Try it on the 14-day trial.
Template lives in Git. New incident → cp template.md YYYY-MM-DD-short-title.md. PR. Review.
Detection chain section is mandatory.
## Detection chain
- 2026-04-12 14:02 UTC — first impacted call
- 2026-04-12 16:51 UTC — automated alert fires (FTL p95 > 1500ms for 30m)
- 2026-04-12 16:53 UTC — on-call ack
- detection_chain_minutes: 169
### What would have detected this in <30 minutes?
- A 5-minute window FTL p95 alert (we had only 30m)
- An LLM-as-judge eval on a fresh sample (we ran daily, should be hourly)
Action items have owners and dates. No "we should consider…"
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Root cause is two-pronged.
## Root cause
### Code
- Prompt change in PR #4421 added a new tool description that overlapped with two existing tools.
### Model
- gpt-4o-realtime preferred the new tool 38% of the time even when the old tool was correct, because the new description matched the user phrase more literally.
Q: Should the model vendor be in the postmortem? A: Yes if their behavior was a contributing factor. We've named OpenAI in two postmortems.
Q: How do I keep it blameless? A: Focus on systems, not people. "Our deploy process didn't catch this" not "Alice missed it."
Q: Are postmortems public? A: Internally always; externally for SEV1 with customer impact. We publish redacted versions.
Q: How long is too long for a postmortem? A: Aim for 1500 words. Longer ones don't get read. Link to the trace and the eval results.
Q: Should I use an AI to draft the postmortem? A: A summarizer that pulls from incident channel, traces, and PRs is fine — a human writes the lessons.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.
A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Self-correction is now a property of the model, not the framework. What that means for production agent reliability, voice/chat fallbacks, and CallSphere.
© 2026 CallSphere LLC. All rights reserved.