---
title: "Building an Incident Response Agent: Automated Triage, Diagnosis, and Remediation"
description: "Learn how to build an AI agent that ingests alerts from monitoring systems, triages severity, runs diagnostic playbooks, attempts automated remediation, and generates post-incident reports."
canonical: https://callsphere.ai/blog/building-incident-response-agent-automated-triage-diagnosis-remediation
category: "Learn Agentic AI"
tags: ["Incident Response", "DevOps", "SRE", "Automation", "Python", "Agentic AI"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:43.960Z
---

# Building an Incident Response Agent: Automated Triage, Diagnosis, and Remediation

> Learn how to build an AI agent that ingests alerts from monitoring systems, triages severity, runs diagnostic playbooks, attempts automated remediation, and generates post-incident reports.

## Why Incident Response Needs an Agent

Traditional incident response relies on a human being woken at 3 AM, reading an alert, opening a runbook, copying commands, and deciding whether the fix worked. Every step introduces latency and human error. An AI incident response agent compresses this cycle from minutes to seconds by automating triage, diagnosis, and first-pass remediation while keeping humans in the loop for high-risk actions.

The core loop is simple: **Ingest alert, classify severity, run diagnostics, attempt fix, escalate if needed, document everything.**

## Architecture Overview

An incident response agent sits between your alerting system (PagerDuty, Opsgenie, Prometheus Alertmanager) and your infrastructure. It receives webhook payloads, enriches them with context, and decides what to do.

```mermaid
flowchart LR
    INC(["Production incident"])
    DETECT["Detect
alerts plus user reports"]
    MIT["Mitigate
rollback or feature flag"]
    RES["Resolve"]
    DOC["Timeline doc
events plus actions"]
    RCA{"5 whys plus
causal graph"}
    AI["Action items
owner plus due date"]
    SHARE(["Blameless review"])
    LEARN[("Runbook plus
eval added")]
    INC --> DETECT --> MIT --> RES --> DOC --> RCA --> AI --> SHARE --> LEARN
    style RCA fill:#4f46e5,stroke:#4338ca,color:#fff
    style LEARN fill:#059669,stroke:#047857,color:#fff
```

```yaml
# alert-webhook-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
data:
  alertmanager.yml: |
    receivers:
      - name: incident-agent
        webhook_configs:
          - url: "http://incident-agent:8080/api/alerts"
            send_resolved: true
    route:
      receiver: incident-agent
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
```

## Building the Alert Ingestion Layer

The agent needs to normalize alerts from different sources into a common format before it can reason about them.

```python
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional

class Severity(Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

@dataclass
class NormalizedAlert:
    alert_id: str
    source: str  # "prometheus", "pagerduty", "cloudwatch"
    title: str
    description: str
    severity: Severity
    service: str
    namespace: str
    labels: dict = field(default_factory=dict)
    started_at: datetime = field(default_factory=datetime.utcnow)
    raw_payload: dict = field(default_factory=dict)

def normalize_prometheus_alert(payload: dict) -> NormalizedAlert:
    """Convert Prometheus Alertmanager webhook to normalized format."""
    alert = payload["alerts"][0]
    labels = alert.get("labels", {})

    severity_map = {
        "critical": Severity.CRITICAL,
        "warning": Severity.HIGH,
        "info": Severity.LOW,
    }

    return NormalizedAlert(
        alert_id=alert["fingerprint"],
        source="prometheus",
        title=labels.get("alertname", "Unknown Alert"),
        description=alert.get("annotations", {}).get("summary", ""),
        severity=severity_map.get(labels.get("severity", "info"), Severity.MEDIUM),
        service=labels.get("service", labels.get("job", "unknown")),
        namespace=labels.get("namespace", "default"),
        labels=labels,
        raw_payload=payload,
    )
```

## The Triage and Diagnosis Engine

The agent uses an LLM to classify the alert and select the right diagnostic runbook. This is where the AI reasoning happens.

```python
import openai
import json

TRIAGE_PROMPT = """You are an SRE incident triage agent. Given the alert below,
determine:
1. The likely root cause category (one of: resource_exhaustion, network,
   application_crash, certificate_expiry, disk_pressure, database, config_drift)
2. The diagnostic commands to run (return as a list)
3. Whether automated remediation is safe (true/false)
4. The escalation urgency (immediate, 15min, 1hr, next_business_day)

Alert: {alert_title}
Description: {alert_description}
Service: {service}
Severity: {severity}
Labels: {labels}

Respond in JSON with keys: root_cause_category, diagnostic_commands,
safe_to_auto_remediate, escalation_urgency, reasoning.
"""

async def triage_alert(alert: NormalizedAlert) -> dict:
    client = openai.AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a senior SRE."},
            {"role": "user", "content": TRIAGE_PROMPT.format(
                alert_title=alert.title,
                alert_description=alert.description,
                service=alert.service,
                severity=alert.severity.value,
                labels=json.dumps(alert.labels),
            )},
        ],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return json.loads(response.choices[0].message.content)
```

## Automated Remediation with Safety Gates

The critical design principle: **never auto-remediate without a safety gate.** The agent checks severity, blast radius, and time-of-day before taking action.

```python
import subprocess
from typing import Optional

SAFE_REMEDIATIONS = {
    "resource_exhaustion": "kubectl rollout restart deployment/{service} -n {namespace}",
    "disk_pressure": "kubectl exec -n {namespace} deploy/{service} -- find /tmp -mtime +7 -delete",
    "certificate_expiry": "kubectl delete secret {service}-tls -n {namespace}",
}

async def attempt_remediation(
    alert: NormalizedAlert,
    triage: dict,
) -> Optional[str]:
    category = triage["root_cause_category"]
    if not triage["safe_to_auto_remediate"]:
        return None

    if alert.severity == Severity.CRITICAL:
        # Critical alerts always need human approval first
        return None

    template = SAFE_REMEDIATIONS.get(category)
    if not template:
        return None

    command = template.format(
        service=alert.service,
        namespace=alert.namespace,
    )
    result = subprocess.run(
        command.split(), capture_output=True, text=True, timeout=60
    )
    return f"Executed: {command}\nOutput: {result.stdout}\nErrors: {result.stderr}"
```

## Post-Incident Report Generation

After every incident, the agent generates a structured report for the team.

```python
async def generate_postincident_report(
    alert: NormalizedAlert,
    triage: dict,
    remediation_result: Optional[str],
    timeline: list[dict],
) -> str:
    client = openai.AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Generate a post-incident report:

Alert: {alert.title} ({alert.severity.value})
Service: {alert.service}
Root Cause Category: {triage['root_cause_category']}
Reasoning: {triage['reasoning']}
Auto-remediation Applied: {remediation_result or 'None (escalated to human)'}
Timeline: {json.dumps(timeline, default=str)}

Format as markdown with: Summary, Timeline, Root Cause, Remediation, Action Items."""
        }],
    )
    return response.choices[0].message.content
```

## FAQ

### How do I prevent the agent from causing more damage during remediation?

Implement a blast radius limiter. Track which services the agent has touched in the last hour. If it has already remediated the same service twice, force escalation to a human. Also keep all remediations behind a dry-run mode that you enable first in staging.

### Should the agent handle alert storms where hundreds of alerts fire at once?

Yes, but with deduplication and grouping. Use the Alertmanager `group_by` configuration to batch related alerts. The agent should deduplicate by fingerprint and prioritize the highest-severity alert in each group rather than processing them individually.

### What monitoring should I put on the incident response agent itself?

Treat it like any critical service. Monitor its webhook endpoint latency, LLM API error rates, remediation success/failure ratios, and escalation counts. Set up a separate alert path that bypasses the agent so you get notified if the agent itself goes down.

---

#IncidentResponse #DevOps #SRE #Automation #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/building-incident-response-agent-automated-triage-diagnosis-remediation
