Skip to content
Agentic AI
Agentic AI5 min read6 views

AI Agent Reliability Patterns: Retries, Fallbacks, and Circuit Breakers for Production Agents

How to build reliable AI agents using battle-tested distributed systems patterns: retry strategies, fallback chains, circuit breakers, and graceful degradation.

Agents Fail. The Question Is How Gracefully.

AI agents in production face a constant stream of failures: API rate limits, tool execution errors, malformed LLM outputs, timeout on external services, and model hallucinations that derail multi-step plans. The difference between a demo agent and a production agent is not capability -- it is reliability engineering.

The good news is that decades of distributed systems engineering have produced patterns that apply directly to agent systems.

Pattern 1: Structured Retries

Not all failures are equal. Your retry strategy should match the failure type:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

@retry(
    retry=retry_if_exception_type((RateLimitError, TimeoutError)),
    wait=wait_exponential(multiplier=1, min=1, max=60),
    stop=stop_after_attempt(5),
    before_sleep=log_retry_attempt
)
async def call_llm(messages, tools):
    return await client.messages.create(
        model="claude-sonnet-4-20250514",
        messages=messages,
        tools=tools
    )

Key principles:

  • Exponential backoff: Prevents thundering herd on rate limits
  • Jitter: Add random jitter to prevent synchronized retries from multiple agents
  • Selective retry: Only retry transient errors (rate limits, timeouts). Do not retry on invalid requests or authentication failures
  • Maximum attempts: Always cap retries to prevent infinite loops

Pattern 2: Model Fallback Chains

When your primary model is unavailable or degraded, fall back to alternatives:

MODEL_CHAIN = [
    {"model": "claude-sonnet-4-20250514", "provider": "anthropic"},
    {"model": "gpt-4o", "provider": "openai"},
    {"model": "claude-haiku-4-20250514", "provider": "anthropic"},  # Cheaper, faster, less capable
]

async def resilient_llm_call(messages, tools):
    for model_config in MODEL_CHAIN:
        try:
            return await call_provider(
                model=model_config["model"],
                provider=model_config["provider"],
                messages=messages,
                tools=tools
            )
        except (ServiceUnavailableError, RateLimitError) as e:
            logger.warning(f"Fallback from {model_config['model']}: {e}")
            continue
    raise AllModelsUnavailableError("Exhausted all model fallbacks")

Important considerations:

flowchart TD
    CENTER(("Key Components"))
    CENTER --> N0["Exponential backoff: Prevents thunderin…"]
    CENTER --> N1["Jitter: Add random jitter to prevent sy…"]
    CENTER --> N2["Maximum attempts: Always cap retries to…"]
    CENTER --> N3["Prompts may need adjustment for differe…"]
    CENTER --> N4["Track which model actually served each …"]
    CENTER --> N5["Use idempotency keys for operations tha…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
  • Prompts may need adjustment for different models (tool schemas, system prompt format)
  • Track which model actually served each request for quality monitoring
  • Quality may degrade with fallback models -- alert when the primary model has been unavailable for extended periods

Pattern 3: Circuit Breakers

Prevent cascading failures by stopping calls to a failing service:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = "CLOSED"  # CLOSED = normal, OPEN = blocking, HALF_OPEN = testing
        self.last_failure_time = None

    async def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = await func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise

Use separate circuit breakers for each external dependency (LLM provider, tool APIs, databases).

Pattern 4: Idempotent Tool Execution

Agent tools must be safe to retry. If a tool call times out, the agent (or retry logic) may call it again. Non-idempotent tools can cause double-charges, duplicate records, or other side effects.

Design principles:

  • Use idempotency keys for operations that create or modify resources
  • Make read operations naturally idempotent
  • Log tool execution results and check for existing results before re-executing
  • Use database transactions with unique constraints to prevent duplicates

Pattern 5: Graceful Degradation

When full functionality is unavailable, provide reduced but useful service:

  • Tool failure: If a search tool fails, the agent can still answer from its parametric knowledge (with appropriate caveats)
  • Context retrieval failure: If RAG retrieval fails, fall back to a general response with a disclaimer
  • Timeout: If the agent cannot complete a complex task within the time budget, return partial results with an explanation

Pattern 6: Checkpointing for Long-Running Agents

Agents that run for minutes or hours should checkpoint their state:

class CheckpointedAgent:
    async def run(self, task):
        checkpoint = await self.load_checkpoint(task.id)

        for step in self.plan(task, resume_from=checkpoint):
            result = await self.execute_step(step)
            await self.save_checkpoint(task.id, step, result)

            if result.failed and not result.retryable:
                return self.partial_result(task.id)

        return self.final_result(task.id)

If the agent crashes or the process restarts, it resumes from the last checkpoint instead of starting over.

Measuring Reliability

Track these metrics to quantify agent reliability:

  • Task completion rate: Percentage of tasks completed successfully
  • Mean time to completion: Average wall-clock time per task
  • Retry rate: How often retries are needed (high rates indicate systemic issues)
  • Fallback rate: How often the primary model/tool is unavailable
  • Error categorization: Breakdown of failures by type (rate limit, timeout, parsing, tool error)

Sources: Microsoft Release It! Patterns | Anthropic Agent Reliability | AWS Well-Architected Framework

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like