Skip to content
Learn Agentic AI
Learn Agentic AI10 min read1 views

Building a Chaos Engineering Agent: AI-Driven Resilience Testing

Learn how to build an AI agent that designs chaos experiments, controls blast radius, injects faults into production systems, observes behavior, and verifies automated recovery.

Why Chaos Engineering Needs Intelligence

Traditional chaos engineering tools inject random faults. Kill a pod. Add latency. Fill a disk. But randomness wastes time testing failures your system already handles well. An AI chaos engineering agent is strategic. It analyzes your architecture, identifies the weakest points, designs experiments that test specific hypotheses, and verifies that recovery actually works.

Experiment Design

The agent designs chaos experiments based on the system architecture and past incident history.

flowchart TD
    START["Building a Chaos Engineering Agent: AI-Driven Res…"] --> A
    A["Why Chaos Engineering Needs Intelligence"]
    A --> B
    B["Experiment Design"]
    B --> C
    C["Blast Radius Control"]
    C --> D
    D["Fault Injection Engine"]
    D --> E
    E["Observation and Recovery Verification"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional

class FaultType(Enum):
    POD_KILL = "pod_kill"
    NETWORK_LATENCY = "network_latency"
    NETWORK_PARTITION = "network_partition"
    CPU_STRESS = "cpu_stress"
    MEMORY_STRESS = "memory_stress"
    DISK_FILL = "disk_fill"
    DNS_FAILURE = "dns_failure"

@dataclass
class ChaosExperiment:
    experiment_id: str
    name: str
    hypothesis: str
    fault_type: FaultType
    target_service: str
    target_namespace: str
    blast_radius: str  # "single_pod", "service", "namespace"
    duration_seconds: int
    parameters: dict = field(default_factory=dict)
    steady_state_checks: list[dict] = field(default_factory=list)
    abort_conditions: list[dict] = field(default_factory=list)

import openai
import json

async def design_experiment(
    architecture: dict, past_incidents: list[dict]
) -> ChaosExperiment:
    client = openai.AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Design a chaos engineering experiment.

Architecture:
{json.dumps(architecture, indent=2)}

Past incidents (to avoid re-testing known weaknesses already fixed):
{json.dumps(past_incidents[:5], indent=2)}

Design an experiment that:
1. Tests a realistic failure mode
2. Has a clear hypothesis about expected behavior
3. Minimizes blast radius
4. Includes abort conditions to prevent real outages

Return JSON with: name, hypothesis, fault_type (one of: pod_kill,
network_latency, network_partition, cpu_stress, memory_stress,
disk_fill, dns_failure), target_service, blast_radius, duration_seconds,
parameters, steady_state_checks (list of PromQL queries with expected ranges),
abort_conditions (list of PromQL queries that trigger immediate halt)."""
        }],
        response_format={"type": "json_object"},
        temperature=0.3,
    )
    data = json.loads(response.choices[0].message.content)
    return ChaosExperiment(
        experiment_id=f"chaos-{int(datetime.utcnow().timestamp())}",
        name=data["name"],
        hypothesis=data["hypothesis"],
        fault_type=FaultType(data["fault_type"]),
        target_service=data["target_service"],
        target_namespace=architecture.get("namespace", "default"),
        blast_radius=data["blast_radius"],
        duration_seconds=data["duration_seconds"],
        parameters=data.get("parameters", {}),
        steady_state_checks=data.get("steady_state_checks", []),
        abort_conditions=data.get("abort_conditions", []),
    )

Blast Radius Control

The most critical part of chaos engineering is preventing experiments from becoming real incidents. The agent enforces strict blast radius limits.

from datetime import datetime

class BlastRadiusController:
    def __init__(self, max_affected_pods: int = 1, excluded_namespaces: list = None):
        self.max_affected_pods = max_affected_pods
        self.excluded_namespaces = excluded_namespaces or [
            "kube-system", "monitoring", "istio-system"
        ]
        self.active_experiments: list[str] = []

    def validate_experiment(self, experiment: ChaosExperiment) -> tuple[bool, str]:
        # Never touch system namespaces
        if experiment.target_namespace in self.excluded_namespaces:
            return False, f"Namespace {experiment.target_namespace} is protected"

        # Only one experiment at a time
        if self.active_experiments:
            return False, "Another experiment is already running"

        # Duration limits
        if experiment.duration_seconds > 300:
            return False, "Experiment duration exceeds 5-minute maximum"

        # Blast radius check
        if experiment.blast_radius == "namespace":
            return False, "Namespace-wide fault not permitted in production"

        return True, "Approved"

    def register_experiment(self, experiment_id: str):
        self.active_experiments.append(experiment_id)

    def deregister_experiment(self, experiment_id: str):
        self.active_experiments.remove(experiment_id)

Fault Injection Engine

The agent generates Kubernetes-native fault injection manifests using Chaos Mesh or LitmusChaos CRDs.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import yaml

class FaultInjector:
    def generate_chaos_mesh_manifest(
        self, experiment: ChaosExperiment
    ) -> dict:
        if experiment.fault_type == FaultType.POD_KILL:
            return {
                "apiVersion": "chaos-mesh.org/v1alpha1",
                "kind": "PodChaos",
                "metadata": {
                    "name": experiment.experiment_id,
                    "namespace": experiment.target_namespace,
                },
                "spec": {
                    "action": "pod-kill",
                    "mode": "one",
                    "selector": {
                        "labelSelectors": {
                            "app": experiment.target_service,
                        },
                    },
                    "duration": f"{experiment.duration_seconds}s",
                },
            }

        elif experiment.fault_type == FaultType.NETWORK_LATENCY:
            latency_ms = experiment.parameters.get("latency_ms", 500)
            return {
                "apiVersion": "chaos-mesh.org/v1alpha1",
                "kind": "NetworkChaos",
                "metadata": {
                    "name": experiment.experiment_id,
                    "namespace": experiment.target_namespace,
                },
                "spec": {
                    "action": "delay",
                    "mode": "one",
                    "selector": {
                        "labelSelectors": {
                            "app": experiment.target_service,
                        },
                    },
                    "delay": {
                        "latency": f"{latency_ms}ms",
                        "jitter": "50ms",
                        "correlation": "50",
                    },
                    "duration": f"{experiment.duration_seconds}s",
                },
            }

        elif experiment.fault_type == FaultType.CPU_STRESS:
            return {
                "apiVersion": "chaos-mesh.org/v1alpha1",
                "kind": "StressChaos",
                "metadata": {
                    "name": experiment.experiment_id,
                    "namespace": experiment.target_namespace,
                },
                "spec": {
                    "mode": "one",
                    "selector": {
                        "labelSelectors": {
                            "app": experiment.target_service,
                        },
                    },
                    "stressors": {
                        "cpu": {
                            "workers": experiment.parameters.get("workers", 2),
                            "load": experiment.parameters.get("load", 80),
                        },
                    },
                    "duration": f"{experiment.duration_seconds}s",
                },
            }

    async def inject(self, experiment: ChaosExperiment) -> bool:
        manifest = self.generate_chaos_mesh_manifest(experiment)
        manifest_yaml = yaml.dump(manifest)

        import subprocess
        result = subprocess.run(
            ["kubectl", "apply", "-f", "-"],
            input=manifest_yaml, capture_output=True, text=True,
        )
        return result.returncode == 0

Observation and Recovery Verification

The agent monitors steady-state metrics during the experiment and verifies the system recovers after the fault is removed.

import asyncio
import httpx

class ExperimentObserver:
    def __init__(self, prometheus_url: str):
        self.prom_url = prometheus_url
        self.http = httpx.AsyncClient(timeout=10)

    async def check_steady_state(
        self, checks: list[dict]
    ) -> tuple[bool, list[str]]:
        violations = []
        for check in checks:
            query = check["query"]
            min_val = check.get("min")
            max_val = check.get("max")

            resp = await self.http.get(
                f"{self.prom_url}/api/v1/query",
                params={"query": query},
            )
            result = resp.json()["data"]["result"]
            if not result:
                violations.append(f"No data for: {query}")
                continue

            value = float(result[0]["value"][1])
            if min_val is not None and value < min_val:
                violations.append(f"{query} = {value} (below min {min_val})")
            if max_val is not None and value > max_val:
                violations.append(f"{query} = {value} (above max {max_val})")

        return len(violations) == 0, violations

    async def run_experiment_loop(
        self, experiment: ChaosExperiment, check_interval: int = 10
    ) -> dict:
        results = {"violations": [], "aborted": False, "recovered": False}
        elapsed = 0

        while elapsed < experiment.duration_seconds:
            healthy, violations = await self.check_steady_state(
                experiment.steady_state_checks
            )
            if violations:
                results["violations"].extend(violations)

            _, abort_violations = await self.check_steady_state(
                experiment.abort_conditions
            )
            if abort_violations:
                results["aborted"] = True
                await self._abort_experiment(experiment)
                break

            await asyncio.sleep(check_interval)
            elapsed += check_interval

        # Post-experiment: verify recovery within 60 seconds
        for _ in range(6):
            await asyncio.sleep(10)
            healthy, _ = await self.check_steady_state(
                experiment.steady_state_checks
            )
            if healthy:
                results["recovered"] = True
                break

        return results

    async def _abort_experiment(self, experiment: ChaosExperiment):
        import subprocess
        subprocess.run([
            "kubectl", "delete", "podchaos,networkchaos,stresschaos",
            experiment.experiment_id,
            "-n", experiment.target_namespace,
        ], capture_output=True)

FAQ

How do I convince my team to run chaos experiments in production?

Start in staging with the agent in "report-only" mode where it designs experiments but only simulates results. Once the team sees the value of the insights, move to production with strict blast radius controls: single-pod only, 60-second maximum duration, and automatic abort on any user-facing degradation. The agent builds confidence gradually.

What if the abort mechanism itself fails?

Implement a hardware timer. The Chaos Mesh duration field ensures the CRD expires automatically even if the agent crashes. Additionally, run a separate watchdog process that queries for active chaos experiments older than the maximum allowed duration and deletes them unconditionally.

How does the agent decide which experiments to run next?

The agent maintains a coverage map of tested failure modes per service. It prioritizes untested combinations: if the payment service has been tested for pod-kill but never for network latency, network latency gets priority. It also weighs services by business criticality and recent change frequency since recently modified code is more likely to have resilience gaps.


#ChaosEngineering #Resilience #SRE #FaultInjection #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

AI Voice Agent Failover and Reliability Patterns for Production

Production reliability patterns for AI voice agents — multi-region failover, circuit breakers, graceful degradation.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.

AI Interview Prep

7 Agentic AI & Multi-Agent System Interview Questions for 2026

Real agentic AI and multi-agent system interview questions from Anthropic, OpenAI, and Microsoft in 2026. Covers agent design patterns, memory systems, safety, orchestration frameworks, tool calling, and evaluation.

Learn Agentic AI

Building Resilient AI Agents: Circuit Breakers, Retries, and Graceful Degradation

Production resilience patterns for AI agents: circuit breakers for LLM APIs, exponential backoff with jitter, fallback models, and graceful degradation strategies.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.