---
title: "AI Agent for Infrastructure Monitoring: Anomaly Detection and Auto-Remediation"
description: "Build an AI agent that continuously ingests infrastructure metrics, detects anomalies using statistical and ML methods, and triggers automated remediation with human approval gates."
canonical: https://callsphere.ai/blog/ai-agent-infrastructure-monitoring-anomaly-detection-auto-remediation
category: "Learn Agentic AI"
tags: ["Infrastructure Monitoring", "Anomaly Detection", "DevOps", "SRE", "Python", "Agentic AI"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T01:34:16.123Z
---

# AI Agent for Infrastructure Monitoring: Anomaly Detection and Auto-Remediation

> Build an AI agent that continuously ingests infrastructure metrics, detects anomalies using statistical and ML methods, and triggers automated remediation with human approval gates.

## Beyond Threshold-Based Alerting

Traditional monitoring fires an alert when a metric crosses a static threshold: CPU above 90%, memory above 85%, disk above 80%. This approach generates noise. A CPU spike to 92% during a batch job at 2 AM is normal. The same spike at 2 PM during low traffic is concerning. An AI monitoring agent learns what "normal" looks like for each metric at each time of day and raises alerts only when the pattern breaks.

## Metrics Ingestion Pipeline

The agent pulls metrics from Prometheus using PromQL and stores them in a time-series buffer for analysis.

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
import httpx
import numpy as np
from datetime import datetime, timedelta
from dataclasses import dataclass

@dataclass
class MetricSeries:
    name: str
    labels: dict
    timestamps: list[float]
    values: list[float]

class PrometheusClient:
    def __init__(self, base_url: str = "http://prometheus:9090"):
        self.base_url = base_url
        self.client = httpx.AsyncClient(timeout=30)

    async def query_range(
        self, query: str, hours_back: int = 24, step: str = "5m"
    ) -> MetricSeries:
        end = datetime.utcnow()
        start = end - timedelta(hours=hours_back)
        resp = await self.client.get(
            f"{self.base_url}/api/v1/query_range",
            params={
                "query": query,
                "start": start.isoformat() + "Z",
                "end": end.isoformat() + "Z",
                "step": step,
            },
        )
        data = resp.json()["data"]["result"][0]
        timestamps = [float(v[0]) for v in data["values"]]
        values = [float(v[1]) for v in data["values"]]
        return MetricSeries(
            name=query,
            labels=data["metric"],
            timestamps=timestamps,
            values=values,
        )

MONITORED_QUERIES = [
    'rate(container_cpu_usage_seconds_total{namespace="production"}[5m])',
    'container_memory_usage_bytes{namespace="production"}',
    'rate(http_requests_total{namespace="production"}[5m])',
    'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))',
]
```

## Anomaly Detection with Z-Score and Seasonal Decomposition

The agent combines simple statistical methods with time-aware baselines. Z-score catches sudden spikes while seasonal decomposition handles expected daily or weekly patterns.

```python
from scipy import stats
from collections import defaultdict

class AnomalyDetector:
    def __init__(self, z_threshold: float = 3.0):
        self.z_threshold = z_threshold
        self.baselines: dict[str, list[float]] = defaultdict(list)

    def detect_zscore_anomalies(self, series: MetricSeries) -> list[dict]:
        values = np.array(series.values)
        if len(values)  self.z_threshold:
                anomalies.append({
                    "timestamp": series.timestamps[i],
                    "value": series.values[i],
                    "z_score": float(z),
                    "method": "zscore",
                    "metric": series.name,
                })
        return anomalies

    def detect_seasonal_anomalies(
        self, series: MetricSeries, period_hours: int = 24
    ) -> list[dict]:
        """Compare current values against same time-of-day from previous periods."""
        values = np.array(series.values)
        timestamps = np.array(series.timestamps)
        samples_per_period = period_hours * 12  # 5min intervals

        if len(values)  self.z_threshold:
                anomalies.append({
                    "timestamp": timestamps[i],
                    "value": float(values[i]),
                    "expected": float(mean),
                    "deviation": float(deviation),
                    "method": "seasonal",
                    "metric": series.name,
                })
        return anomalies
```

## Human Approval Gate for Remediation

When the agent detects an anomaly, it proposes a remediation and waits for approval on critical actions.

```python
import asyncio
from enum import Enum

class ApprovalStatus(Enum):
    PENDING = "pending"
    APPROVED = "approved"
    DENIED = "denied"
    AUTO_APPROVED = "auto_approved"

class ApprovalGate:
    def __init__(self, slack_webhook: str, auto_approve_severity: str = "low"):
        self.slack_webhook = slack_webhook
        self.auto_approve_severity = auto_approve_severity
        self.pending: dict[str, asyncio.Event] = {}
        self.decisions: dict[str, ApprovalStatus] = {}

    async def request_approval(
        self, action_id: str, description: str, severity: str
    ) -> ApprovalStatus:
        if severity == self.auto_approve_severity:
            return ApprovalStatus.AUTO_APPROVED

        event = asyncio.Event()
        self.pending[action_id] = event

        await self._send_slack_message(
            f"*Remediation Approval Required*\n"
            f"Action: {description}\n"
            f"Severity: {severity}\n"
            f"Reply with: /approve {action_id} or /deny {action_id}"
        )

        try:
            await asyncio.wait_for(event.wait(), timeout=300)
            return self.decisions.get(action_id, ApprovalStatus.DENIED)
        except asyncio.TimeoutError:
            return ApprovalStatus.DENIED

    async def _send_slack_message(self, text: str):
        async with httpx.AsyncClient() as client:
            await client.post(self.slack_webhook, json={"text": text})
```

## The Monitoring Agent Loop

Tie everything together in a continuous monitoring loop that runs every cycle.

```python
async def monitoring_loop(interval_seconds: int = 300):
    prom = PrometheusClient()
    detector = AnomalyDetector(z_threshold=3.0)
    gate = ApprovalGate(slack_webhook="https://hooks.slack.com/...")

    while True:
        for query in MONITORED_QUERIES:
            series = await prom.query_range(query, hours_back=48)
            anomalies = detector.detect_zscore_anomalies(series)
            anomalies += detector.detect_seasonal_anomalies(series)

            for anomaly in anomalies:
                severity = "high" if anomaly.get("z_score", 0) > 5 else "medium"
                status = await gate.request_approval(
                    action_id=f"{query}-{anomaly['timestamp']}",
                    description=f"Scale up pods for {query}",
                    severity=severity,
                )
                if status in (ApprovalStatus.APPROVED, ApprovalStatus.AUTO_APPROVED):
                    await execute_remediation(query, anomaly)

        await asyncio.sleep(interval_seconds)
```

## FAQ

### How do I avoid false positives from noisy metrics?

Use a sliding window to require multiple consecutive anomalous data points before triggering. A single spike is noise. Three consecutive 5-minute intervals above the threshold is a real problem. Also tune your z-score threshold per metric since some are naturally more variable than others.

### Should the agent train its own ML model or use statistical methods?

Start with statistical methods like z-score and seasonal decomposition. They are interpretable, require no training data, and work well for most infrastructure metrics. Graduate to ML models (isolation forest, LSTM autoencoders) only when you have metrics with complex non-linear patterns that statistical methods miss.

### How do I handle the cold-start problem when there is no historical data?

Fall back to static thresholds for the first 48-72 hours of data collection. Once the agent has enough history, automatically switch to anomaly detection. Log a warning when operating in cold-start mode so the team knows alerting quality may be lower.

---

#InfrastructureMonitoring #AnomalyDetection #DevOps #SRE #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/ai-agent-infrastructure-monitoring-anomaly-detection-auto-remediation
