Building an Incident Response Agent: Automated Triage, Diagnosis, and Remediation

Why Incident Response Needs an Agent

Traditional incident response relies on a human being woken at 3 AM, reading an alert, opening a runbook, copying commands, and deciding whether the fix worked. Every step introduces latency and human error. An AI incident response agent compresses this cycle from minutes to seconds by automating triage, diagnosis, and first-pass remediation while keeping humans in the loop for high-risk actions.

The core loop is simple: Ingest alert, classify severity, run diagnostics, attempt fix, escalate if needed, document everything.

Architecture Overview

An incident response agent sits between your alerting system (PagerDuty, Opsgenie, Prometheus Alertmanager) and your infrastructure. It receives webhook payloads, enriches them with context, and decides what to do.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    INC(["Production incident"])
    DETECT["Detect<br/>alerts plus user reports"]
    MIT["Mitigate<br/>rollback or feature flag"]
    RES["Resolve"]
    DOC["Timeline doc<br/>events plus actions"]
    RCA{"5 whys plus<br/>causal graph"}
    AI["Action items<br/>owner plus due date"]
    SHARE(["Blameless review"])
    LEARN[("Runbook plus<br/>eval added")]
    INC --> DETECT --> MIT --> RES --> DOC --> RCA --> AI --> SHARE --> LEARN
    style RCA fill:#4f46e5,stroke:#4338ca,color:#fff
    style LEARN fill:#059669,stroke:#047857,color:#fff

# alert-webhook-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
data:
  alertmanager.yml: |
    receivers:
      - name: incident-agent
        webhook_configs:
          - url: "http://incident-agent:8080/api/alerts"
            send_resolved: true
    route:
      receiver: incident-agent
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h

Building the Alert Ingestion Layer

The agent needs to normalize alerts from different sources into a common format before it can reason about them.

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional

class Severity(Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

@dataclass
class NormalizedAlert:
    alert_id: str
    source: str  # "prometheus", "pagerduty", "cloudwatch"
    title: str
    description: str
    severity: Severity
    service: str
    namespace: str
    labels: dict = field(default_factory=dict)
    started_at: datetime = field(default_factory=datetime.utcnow)
    raw_payload: dict = field(default_factory=dict)

def normalize_prometheus_alert(payload: dict) -> NormalizedAlert:
    """Convert Prometheus Alertmanager webhook to normalized format."""
    alert = payload["alerts"][0]
    labels = alert.get("labels", {})

    severity_map = {
        "critical": Severity.CRITICAL,
        "warning": Severity.HIGH,
        "info": Severity.LOW,
    }

    return NormalizedAlert(
        alert_id=alert["fingerprint"],
        source="prometheus",
        title=labels.get("alertname", "Unknown Alert"),
        description=alert.get("annotations", {}).get("summary", ""),
        severity=severity_map.get(labels.get("severity", "info"), Severity.MEDIUM),
        service=labels.get("service", labels.get("job", "unknown")),
        namespace=labels.get("namespace", "default"),
        labels=labels,
        raw_payload=payload,
    )

The Triage and Diagnosis Engine

The agent uses an LLM to classify the alert and select the right diagnostic runbook. This is where the AI reasoning happens.

import openai
import json

TRIAGE_PROMPT = """You are an SRE incident triage agent. Given the alert below,
determine:
1. The likely root cause category (one of: resource_exhaustion, network,
   application_crash, certificate_expiry, disk_pressure, database, config_drift)
2. The diagnostic commands to run (return as a list)
3. Whether automated remediation is safe (true/false)
4. The escalation urgency (immediate, 15min, 1hr, next_business_day)

Alert: {alert_title}
Description: {alert_description}
Service: {service}
Severity: {severity}
Labels: {labels}

Respond in JSON with keys: root_cause_category, diagnostic_commands,
safe_to_auto_remediate, escalation_urgency, reasoning.
"""

async def triage_alert(alert: NormalizedAlert) -> dict:
    client = openai.AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a senior SRE."},
            {"role": "user", "content": TRIAGE_PROMPT.format(
                alert_title=alert.title,
                alert_description=alert.description,
                service=alert.service,
                severity=alert.severity.value,
                labels=json.dumps(alert.labels),
            )},
        ],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return json.loads(response.choices[0].message.content)

Automated Remediation with Safety Gates

The critical design principle: never auto-remediate without a safety gate. The agent checks severity, blast radius, and time-of-day before taking action.

import subprocess
from typing import Optional

SAFE_REMEDIATIONS = {
    "resource_exhaustion": "kubectl rollout restart deployment/{service} -n {namespace}",
    "disk_pressure": "kubectl exec -n {namespace} deploy/{service} -- find /tmp -mtime +7 -delete",
    "certificate_expiry": "kubectl delete secret {service}-tls -n {namespace}",
}

async def attempt_remediation(
    alert: NormalizedAlert,
    triage: dict,
) -> Optional[str]:
    category = triage["root_cause_category"]
    if not triage["safe_to_auto_remediate"]:
        return None

    if alert.severity == Severity.CRITICAL:
        # Critical alerts always need human approval first
        return None

    template = SAFE_REMEDIATIONS.get(category)
    if not template:
        return None

    command = template.format(
        service=alert.service,
        namespace=alert.namespace,
    )
    result = subprocess.run(
        command.split(), capture_output=True, text=True, timeout=60
    )
    return f"Executed: {command}\nOutput: {result.stdout}\nErrors: {result.stderr}"

Post-Incident Report Generation

After every incident, the agent generates a structured report for the team.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

async def generate_postincident_report(
    alert: NormalizedAlert,
    triage: dict,
    remediation_result: Optional[str],
    timeline: list[dict],
) -> str:
    client = openai.AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Generate a post-incident report:

Alert: {alert.title} ({alert.severity.value})
Service: {alert.service}
Root Cause Category: {triage['root_cause_category']}
Reasoning: {triage['reasoning']}
Auto-remediation Applied: {remediation_result or 'None (escalated to human)'}
Timeline: {json.dumps(timeline, default=str)}

Format as markdown with: Summary, Timeline, Root Cause, Remediation, Action Items."""
        }],
    )
    return response.choices[0].message.content

FAQ

How do I prevent the agent from causing more damage during remediation?

Implement a blast radius limiter. Track which services the agent has touched in the last hour. If it has already remediated the same service twice, force escalation to a human. Also keep all remediations behind a dry-run mode that you enable first in staging.

Should the agent handle alert storms where hundreds of alerts fire at once?

Yes, but with deduplication and grouping. Use the Alertmanager group_by configuration to batch related alerts. The agent should deduplicate by fingerprint and prioritize the highest-severity alert in each group rather than processing them individually.

What monitoring should I put on the incident response agent itself?

Treat it like any critical service. Monitor its webhook endpoint latency, LLM API error rates, remediation success/failure ratios, and escalation counts. Set up a separate alert path that bypasses the agent so you get notified if the agent itself goes down.

#IncidentResponse #DevOps #SRE #Automation #Python #AgenticAI #LearnAI #AIEngineering

Building an Incident Response Agent: Automated Triage, Diagnosis, and Remediation

Why Incident Response Needs an Agent

Architecture Overview

Building the Alert Ingestion Layer

The Triage and Diagnosis Engine

Automated Remediation with Safety Gates

Post-Incident Report Generation

FAQ

How do I prevent the agent from causing more damage during remediation?

Should the agent handle alert storms where hundreds of alerts fire at once?

What monitoring should I put on the incident response agent itself?

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Anthropic Skills System: Loadable Tool Packs for Claude Agents

Enterprise CIO Guide: Harvey AI — Legal Agents Move from Pilot to Practice

Enterprise CIO Guide: Perplexity Comet — The Agentic Browser Goes Mass Market

Enterprise CIO Guide: Hippocratic AI — Healthcare Agents at Scale