---
title: "Post-Mortem Analysis for AI Agent Failures: Learning from Production Incidents"
description: "Build systematic post-mortem processes for AI agent failures including incident classification, automated root cause analysis, action item tracking, and a knowledge base that prevents recurring issues."
canonical: https://callsphere.ai/blog/post-mortem-analysis-ai-agent-failures-learning-production-incidents
category: "Learn Agentic AI"
tags: ["Post-Mortem", "Incident Analysis", "Root Cause Analysis", "AI Agents", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:44.184Z
---

# Post-Mortem Analysis for AI Agent Failures: Learning from Production Incidents

> Build systematic post-mortem processes for AI agent failures including incident classification, automated root cause analysis, action item tracking, and a knowledge base that prevents recurring issues.

## Failures Are Data, Not Just Problems

Every AI agent failure carries information about system weaknesses, edge cases, and assumptions that do not hold in production. Teams that treat failures as one-off bugs to squash miss the pattern. Teams that run structured post-mortems build increasingly resilient systems because each incident reduces the probability of the next.

For AI agents specifically, post-mortems are even more valuable because the failure modes are novel — hallucinations, prompt injection, tool misuse, and multi-step reasoning failures do not appear in traditional software engineering playbooks.

## Incident Classification Framework

Not every error deserves a post-mortem. A classification system triages failures by severity and novelty.

```mermaid
flowchart LR
    INC(["Production incident"])
    DETECT["Detect
alerts plus user reports"]
    MIT["Mitigate
rollback or feature flag"]
    RES["Resolve"]
    DOC["Timeline doc
events plus actions"]
    RCA{"5 whys plus
causal graph"}
    AI["Action items
owner plus due date"]
    SHARE(["Blameless review"])
    LEARN[("Runbook plus
eval added")]
    INC --> DETECT --> MIT --> RES --> DOC --> RCA --> AI --> SHARE --> LEARN
    style RCA fill:#4f46e5,stroke:#4338ca,color:#fff
    style LEARN fill:#059669,stroke:#047857,color:#fff
```

```python
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional

class IncidentSeverity(Enum):
    SEV1 = "sev1"  # Complete service outage or data loss
    SEV2 = "sev2"  # Major feature broken, many users affected
    SEV3 = "sev3"  # Minor feature broken, workaround exists
    SEV4 = "sev4"  # Cosmetic or low-impact issue

class IncidentCategory(Enum):
    LLM_HALLUCINATION = "llm_hallucination"
    LLM_REFUSAL = "llm_refusal"
    TOOL_FAILURE = "tool_failure"
    PROMPT_INJECTION = "prompt_injection"
    TIMEOUT = "timeout"
    RATE_LIMIT = "rate_limit"
    DATA_CORRUPTION = "data_corruption"
    BUSINESS_LOGIC = "business_logic"
    INFRASTRUCTURE = "infrastructure"

@dataclass
class Incident:
    id: str
    title: str
    severity: IncidentSeverity
    category: IncidentCategory
    description: str
    timeline: list[dict] = field(default_factory=list)
    root_cause: str = ""
    contributing_factors: list[str] = field(default_factory=list)
    action_items: list[dict] = field(default_factory=list)
    created_at: datetime = field(default_factory=datetime.utcnow)
    resolved_at: Optional[datetime] = None
    post_mortem_completed: bool = False
```

## Automated Incident Capture

Instead of relying on engineers to manually file incidents, instrument the agent pipeline to automatically capture and classify failures.

```python
import traceback
import uuid
import json

class IncidentCapture:
    def __init__(self):
        self.incidents: list[Incident] = []

    def capture(
        self, error: Exception, context: dict, severity: IncidentSeverity = None,
    ) -> Incident:
        category = self._classify_error(error, context)
        if severity is None:
            severity = self._estimate_severity(error, category, context)

        incident = Incident(
            id=str(uuid.uuid4())[:8],
            title=f"{category.value}: {type(error).__name__}",
            severity=severity,
            category=category,
            description=str(error),
            timeline=[
                {
                    "time": datetime.utcnow().isoformat(),
                    "event": "incident_detected",
                    "details": {
                        "error_type": type(error).__name__,
                        "error_message": str(error),
                        "stack_trace": traceback.format_exc(),
                        "context": context,
                    },
                }
            ],
        )
        self.incidents.append(incident)
        return incident

    def _classify_error(self, error: Exception, context: dict) -> IncidentCategory:
        error_str = str(error).lower()

        if "rate limit" in error_str or "429" in error_str:
            return IncidentCategory.RATE_LIMIT
        if "timeout" in error_str or isinstance(error, TimeoutError):
            return IncidentCategory.TIMEOUT
        if context.get("tool_name"):
            return IncidentCategory.TOOL_FAILURE
        if "refused" in error_str or "cannot assist" in error_str:
            return IncidentCategory.LLM_REFUSAL
        return IncidentCategory.INFRASTRUCTURE

    def _estimate_severity(
        self, error: Exception, category: IncidentCategory, context: dict,
    ) -> IncidentSeverity:
        if category == IncidentCategory.DATA_CORRUPTION:
            return IncidentSeverity.SEV1
        if category in (IncidentCategory.PROMPT_INJECTION, IncidentCategory.BUSINESS_LOGIC):
            return IncidentSeverity.SEV2
        if context.get("user_facing", False):
            return IncidentSeverity.SEV3
        return IncidentSeverity.SEV4
```

## Structured Root Cause Analysis

The "5 Whys" technique works well for AI agent failures. Automate the template to ensure consistent analysis.

```python
@dataclass
class RootCauseAnalysis:
    incident_id: str
    whys: list[str] = field(default_factory=list)
    root_cause: str = ""
    is_novel: bool = False
    similar_incidents: list[str] = field(default_factory=list)

class RCAEngine:
    def __init__(self, knowledge_base: "IncidentKnowledgeBase"):
        self.kb = knowledge_base

    def create_rca(self, incident: Incident) -> RootCauseAnalysis:
        similar = self.kb.find_similar(incident)
        rca = RootCauseAnalysis(
            incident_id=incident.id,
            similar_incidents=[s.id for s in similar],
            is_novel=len(similar) == 0,
        )
        return rca

    def complete_rca(self, rca: RootCauseAnalysis, whys: list[str], root_cause: str):
        rca.whys = whys
        rca.root_cause = root_cause
```

## Action Item Tracking

Post-mortems without follow-through are theater. Track action items with owners and deadlines.

```python
@dataclass
class ActionItem:
    id: str
    incident_id: str
    description: str
    owner: str
    priority: str  # P0, P1, P2
    deadline: Optional[datetime] = None
    status: str = "open"  # open, in_progress, completed
    completed_at: Optional[datetime] = None

class ActionTracker:
    def __init__(self):
        self.items: list[ActionItem] = []

    def add(self, incident_id: str, description: str,
            owner: str, priority: str, deadline: datetime = None) -> ActionItem:
        item = ActionItem(
            id=str(uuid.uuid4())[:8],
            incident_id=incident_id,
            description=description,
            owner=owner,
            priority=priority,
            deadline=deadline,
        )
        self.items.append(item)
        return item

    def overdue(self) -> list[ActionItem]:
        now = datetime.utcnow()
        return [
            item for item in self.items
            if item.status == "open"
            and item.deadline
            and item.deadline  float:
        if not self.items:
            return 0.0
        completed = sum(1 for i in self.items if i.status == "completed")
        return completed / len(self.items)
```

## Incident Knowledge Base

The knowledge base stores past incidents and enables pattern matching to detect recurring issues.

```python
class IncidentKnowledgeBase:
    def __init__(self):
        self.incidents: list[Incident] = []
        self.patterns: dict[str, list[str]] = {}

    def add_incident(self, incident: Incident):
        self.incidents.append(incident)
        key = f"{incident.category.value}:{incident.severity.value}"
        if key not in self.patterns:
            self.patterns[key] = []
        self.patterns[key].append(incident.id)

    def find_similar(self, incident: Incident) -> list[Incident]:
        return [
            i for i in self.incidents
            if i.category == incident.category
            and i.id != incident.id
        ]

    def recurring_patterns(self, min_occurrences: int = 3) -> list[dict]:
        recurring = []
        for key, ids in self.patterns.items():
            if len(ids) >= min_occurrences:
                category, severity = key.split(":")
                recurring.append({
                    "category": category,
                    "severity": severity,
                    "count": len(ids),
                    "incident_ids": ids,
                })
        return sorted(recurring, key=lambda x: x["count"], reverse=True)

    def stats(self) -> dict:
        from collections import Counter
        categories = Counter(i.category.value for i in self.incidents)
        severities = Counter(i.severity.value for i in self.incidents)
        return {
            "total": len(self.incidents),
            "by_category": dict(categories),
            "by_severity": dict(severities),
            "recurring_patterns": len(self.recurring_patterns()),
        }
```

## Generating Post-Mortem Reports

Combine all the components into a structured, readable report.

```python
def generate_post_mortem(
    incident: Incident,
    rca: RootCauseAnalysis,
    actions: list[ActionItem],
) -> str:
    report = f"""# Post-Mortem: {incident.title}

**Incident ID:** {incident.id}
**Severity:** {incident.severity.value}
**Category:** {incident.category.value}
**Created:** {incident.created_at.isoformat()}
**Resolved:** {incident.resolved_at.isoformat() if incident.resolved_at else "Ongoing"}

## Description
{incident.description}

## Timeline
"""
    for event in incident.timeline:
        report += f"- **{event['time']}**: {event['event']}\n"

    report += f"""
## Root Cause Analysis (5 Whys)
"""
    for i, why in enumerate(rca.whys, 1):
        report += f"{i}. {why}\n"

    report += f"""
**Root Cause:** {rca.root_cause}
**Novel incident:** {"Yes" if rca.is_novel else "No"}
**Similar past incidents:** {', '.join(rca.similar_incidents) or "None"}

## Action Items
"""
    for item in actions:
        status_marker = "x" if item.status == "completed" else " "
        report += f"- [{status_marker}] [{item.priority}] {item.description} (Owner: {item.owner})\n"

    return report
```

## FAQ

### How do I decide which incidents warrant a full post-mortem?

Run full post-mortems for all SEV1 and SEV2 incidents, all novel failure modes regardless of severity, and any incident that a customer reported. For SEV3 and SEV4 incidents that match existing patterns, a lightweight review (verify the pattern, confirm existing action items are progressing) is sufficient.

### How do I prevent post-mortems from becoming blame sessions?

Establish a blameless culture by focusing the analysis on system factors, not individual decisions. Use language like "the system allowed" instead of "the engineer caused." The 5 Whys technique naturally shifts focus toward systemic root causes. Document the process, not the person — future readers need to understand what the system did, not who was on call.

### Should AI agent post-mortems differ from traditional software post-mortems?

Yes, in two key ways. First, add a "model behavior" section that captures what the LLM said or did that was unexpected — this data improves prompts and guardrails. Second, track whether the failure was deterministic (it will always happen with this input) or probabilistic (it happens some percentage of the time). Probabilistic failures require statistical testing to verify fixes, not just a single successful test run.

---

#PostMortem #IncidentAnalysis #RootCauseAnalysis #AIAgents #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/post-mortem-analysis-ai-agent-failures-learning-production-incidents
