---
title: "The Circuit Breaker Pattern: Protecting Agent Systems from Cascading Failures"
description: "Implement the Circuit Breaker pattern to protect AI agent systems from cascading failures with automatic failure detection, open/half-open/closed states, and graceful recovery."
canonical: https://callsphere.ai/blog/circuit-breaker-pattern-protecting-agent-systems-cascading-failures
category: "Learn Agentic AI"
tags: ["Agent Design Patterns", "Circuit Breaker", "Python", "Fault Tolerance", "Agentic AI"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-08T22:13:37.039Z
---

# The Circuit Breaker Pattern: Protecting Agent Systems from Cascading Failures

> Implement the Circuit Breaker pattern to protect AI agent systems from cascading failures with automatic failure detection, open/half-open/closed states, and graceful recovery.

## Why Agent Systems Need Circuit Breakers

AI agents depend on external services — LLM APIs, databases, tool endpoints — that can fail or slow down. Without protection, a failing dependency causes the agent to hang or error repeatedly, consuming resources and potentially bringing down the entire system. The Circuit Breaker pattern detects sustained failures and stops making requests to the failing service, allowing it time to recover while the agent falls back to alternative behavior.

The name comes from electrical engineering: when a circuit experiences an overload, the breaker trips open to prevent damage. Once conditions stabilize, the breaker closes and normal operation resumes.

## The Three States

1. **Closed** — Normal operation. Requests flow through. Failures are counted.
2. **Open** — The breaker has tripped. All requests immediately fail with a fallback response. No calls are made to the downstream service.
3. **Half-Open** — After a cooldown period, the breaker allows a limited number of test requests through. If they succeed, the breaker closes. If they fail, it opens again.

## Implementation

```python
from dataclasses import dataclass
from datetime import datetime, timedelta
from enum import Enum
from typing import Callable, Any
import threading

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CircuitStats:
    total_calls: int = 0
    failures: int = 0
    successes: int = 0
    last_failure_time: datetime | None = None
    last_success_time: datetime | None = None

class CircuitBreaker:
    def __init__(
        self,
        name: str,
        failure_threshold: int = 5,
        recovery_timeout: int = 30,
        half_open_max_calls: int = 3,
        success_threshold: int = 2,
    ):
        self.name = name
        self.failure_threshold = failure_threshold
        self.recovery_timeout = timedelta(seconds=recovery_timeout)
        self.half_open_max_calls = half_open_max_calls
        self.success_threshold = success_threshold

        self._state = CircuitState.CLOSED
        self._stats = CircuitStats()
        self._half_open_calls = 0
        self._half_open_successes = 0
        self._lock = threading.Lock()
        self._opened_at: datetime | None = None

    @property
    def state(self) -> CircuitState:
        with self._lock:
            if (self._state == CircuitState.OPEN
                    and self._opened_at
                    and datetime.now() - self._opened_at
                    >= self.recovery_timeout):
                self._transition_to(CircuitState.HALF_OPEN)
            return self._state

    def _transition_to(self, new_state: CircuitState):
        old = self._state
        self._state = new_state
        print(f"[{self.name}] Circuit: {old.value} -> {new_state.value}")

        if new_state == CircuitState.OPEN:
            self._opened_at = datetime.now()
        elif new_state == CircuitState.HALF_OPEN:
            self._half_open_calls = 0
            self._half_open_successes = 0
        elif new_state == CircuitState.CLOSED:
            self._stats.failures = 0

    def call(self, func: Callable, *args,
             fallback: Callable | None = None,
             **kwargs) -> Any:
        current = self.state

        if current == CircuitState.OPEN:
            if fallback:
                return fallback(*args, **kwargs)
            raise CircuitOpenError(
                f"Circuit '{self.name}' is OPEN. "
                f"Retry after {self.recovery_timeout.seconds}s."
            )

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            if fallback:
                return fallback(*args, **kwargs)
            raise

    def _on_success(self):
        with self._lock:
            self._stats.successes += 1
            self._stats.total_calls += 1
            self._stats.last_success_time = datetime.now()

            if self._state == CircuitState.HALF_OPEN:
                self._half_open_successes += 1
                if (self._half_open_successes
                        >= self.success_threshold):
                    self._transition_to(CircuitState.CLOSED)

    def _on_failure(self):
        with self._lock:
            self._stats.failures += 1
            self._stats.total_calls += 1
            self._stats.last_failure_time = datetime.now()

            if self._state == CircuitState.HALF_OPEN:
                self._transition_to(CircuitState.OPEN)
            elif (self._state == CircuitState.CLOSED
                  and self._stats.failures >= self.failure_threshold):
                self._transition_to(CircuitState.OPEN)

class CircuitOpenError(Exception):
    pass
```

## Using Circuit Breakers with AI Agents

```python
import openai

client = openai.OpenAI()

# Create a breaker for the LLM API
llm_breaker = CircuitBreaker(
    name="openai-api",
    failure_threshold=3,
    recovery_timeout=60,
    success_threshold=2,
)

def call_llm(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        timeout=10,
    )
    return response.choices[0].message.content

def cached_fallback(prompt: str) -> str:
    return "[Service temporarily unavailable. Using cached response.]"

# Protected call
result = llm_breaker.call(
    call_llm,
    "Explain quantum computing",
    fallback=cached_fallback,
)
print(result)
print(f"Circuit state: {llm_breaker.state.value}")
```

## Decorator Variant for Cleaner Usage

```python
from functools import wraps

def circuit_protected(breaker: CircuitBreaker,
                      fallback: Callable | None = None):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            return breaker.call(func, *args, fallback=fallback,
                                **kwargs)
        return wrapper
    return decorator

@circuit_protected(llm_breaker, fallback=cached_fallback)
def summarize(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Summarize concisely."},
            {"role": "user", "content": text},
        ],
    )
    return response.choices[0].message.content
```

## Monitoring Circuit Health

Expose circuit breaker statistics for observability. Track how often each breaker opens, how long it stays open, and whether half-open test calls are succeeding. These metrics reveal which dependencies are unreliable and help you size `failure_threshold` and `recovery_timeout` appropriately.

```mermaid
flowchart TD
    CALL(["Inbound Call"])
    HEALTH{"Primary
agent healthy?"}
    PRIMARY["Primary agent
LLM provider A"]
    SECONDARY["Hot standby
LLM provider B"]
    QUEUE[("Persisted
call state")]
    HUMAN(["Live human
fallback"])
    DONE(["Caller served"])
    CALL --> HEALTH
    HEALTH -->|Yes| PRIMARY
    HEALTH -->|Timeout or 5xx| SECONDARY
    PRIMARY --> QUEUE
    SECONDARY --> QUEUE
    PRIMARY --> DONE
    SECONDARY --> DONE
    SECONDARY -->|Both fail| HUMAN
    style HEALTH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PRIMARY fill:#4f46e5,stroke:#4338ca,color:#fff
    style SECONDARY fill:#0ea5e9,stroke:#0369a1,color:#fff
    style HUMAN fill:#dc2626,stroke:#b91c1c,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
```

## FAQ

### How do I choose the right failure threshold and recovery timeout?

Start with a failure threshold of 5 and a recovery timeout of 30-60 seconds. Monitor your system under real traffic and adjust. Services with high latency variance may need higher thresholds to avoid false trips. Services that recover slowly need longer timeouts. Measure the actual mean time to recovery (MTTR) for each dependency and set the timeout slightly above it.

### Should each agent have its own circuit breaker or share one?

Use one circuit breaker per downstream dependency, not per agent. If three agents all call the same LLM API, they should share a single breaker for that API. This way, failures detected by one agent protect all agents from hammering a downed service. Store breakers in a shared registry that agents access by dependency name.

### How does the circuit breaker interact with retry logic?

The circuit breaker should wrap the retry logic, not the other way around. Retries happen inside the `func` that the breaker calls. If all retries fail, that counts as one failure for the breaker. This prevents retries from inflating the failure count and tripping the breaker prematurely.

---

#AgentDesignPatterns #CircuitBreaker #Python #FaultTolerance #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/circuit-breaker-pattern-protecting-agent-systems-cascading-failures
