---
title: "Retry Strategies for LLM API Calls: Exponential Backoff with Jitter and Tenacity"
description: "Implement production-grade retry logic for LLM API calls using exponential backoff, jitter, and the Tenacity library. Learn when to retry, when to stop, and how to avoid the thundering herd problem."
canonical: https://callsphere.ai/blog/retry-strategies-llm-api-calls-exponential-backoff-jitter-tenacity
category: "Learn Agentic AI"
tags: ["Retry Patterns", "Exponential Backoff", "Tenacity", "LLM APIs", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:44.159Z
---

# Retry Strategies for LLM API Calls: Exponential Backoff with Jitter and Tenacity

> Implement production-grade retry logic for LLM API calls using exponential backoff, jitter, and the Tenacity library. Learn when to retry, when to stop, and how to avoid the thundering herd problem.

## The Problem with Naive Retries

LLM API calls fail regularly. Rate limits, server overload, network blips, and cold start latency all cause intermittent errors. The instinct is to wrap the call in a while loop with a sleep, but naive retries create serious problems: they hammer the already-stressed API, synchronize retry storms across clients, and can rack up costs by resending expensive prompts repeatedly.

Production agents need structured retry strategies that maximize success probability while minimizing waste.

## Understanding Backoff Algorithms

### Fixed Delay

The simplest approach — wait a constant duration between retries. This works for isolated scripts but fails in production because all clients retry at the same intervals, creating synchronized load spikes.

```mermaid
flowchart TD
    CALL(["Inbound Call"])
    HEALTH{"Primary
agent healthy?"}
    PRIMARY["Primary agent
LLM provider A"]
    SECONDARY["Hot standby
LLM provider B"]
    QUEUE[("Persisted
call state")]
    HUMAN(["Live human
fallback"])
    DONE(["Caller served"])
    CALL --> HEALTH
    HEALTH -->|Yes| PRIMARY
    HEALTH -->|Timeout or 5xx| SECONDARY
    PRIMARY --> QUEUE
    SECONDARY --> QUEUE
    PRIMARY --> DONE
    SECONDARY --> DONE
    SECONDARY -->|Both fail| HUMAN
    style HEALTH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PRIMARY fill:#4f46e5,stroke:#4338ca,color:#fff
    style SECONDARY fill:#0ea5e9,stroke:#0369a1,color:#fff
    style HUMAN fill:#dc2626,stroke:#b91c1c,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
```

### Exponential Backoff

Each retry waits exponentially longer: 1s, 2s, 4s, 8s, 16s. This gives the overloaded service time to recover. However, if many clients start failing at the same time, they all retry at the same exponential intervals.

### Exponential Backoff with Jitter

Adding randomness (jitter) to the backoff interval desynchronizes clients. This is the gold standard for distributed systems.

```python
import random
import time
import httpx

def exponential_backoff_with_jitter(
    attempt: int,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> float:
    """Calculate delay with full jitter strategy."""
    exp_delay = base_delay * (2 ** attempt)
    capped = min(exp_delay, max_delay)
    return random.uniform(0, capped)

def call_llm_with_retry(
    prompt: str,
    max_attempts: int = 5,
    retryable_status_codes: set = None,
) -> dict:
    if retryable_status_codes is None:
        retryable_status_codes = {429, 500, 502, 503, 504}

    last_exception = None
    for attempt in range(max_attempts):
        try:
            response = httpx.post(
                "https://api.openai.com/v1/chat/completions",
                json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
                headers={"Authorization": "Bearer ..."},
                timeout=30.0,
            )
            if response.status_code == 200:
                return response.json()
            if response.status_code not in retryable_status_codes:
                raise RuntimeError(f"Non-retryable status: {response.status_code}")

            delay = exponential_backoff_with_jitter(attempt)
            print(f"Attempt {attempt + 1} got {response.status_code}, retrying in {delay:.1f}s")
            time.sleep(delay)

        except (httpx.ConnectTimeout, httpx.ReadTimeout) as exc:
            last_exception = exc
            delay = exponential_backoff_with_jitter(attempt)
            time.sleep(delay)

    raise RuntimeError(f"All {max_attempts} attempts failed") from last_exception
```

## Using Tenacity for Production Retries

The Tenacity library provides a declarative, composable retry framework that eliminates boilerplate.

```python
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential_jitter,
    retry_if_exception_type,
    before_sleep_log,
    after_log,
)
import logging

logger = logging.getLogger("agent.llm")

class RateLimitError(Exception):
    pass

class ServerOverloadError(Exception):
    pass

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(
        initial=1,
        max=60,
        jitter=5,
    ),
    retry=retry_if_exception_type((RateLimitError, ServerOverloadError, TimeoutError)),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    after=after_log(logger, logging.INFO),
    reraise=True,
)
async def call_llm(messages: list[dict], model: str = "gpt-4o") -> str:
    """Call LLM with automatic retry on transient failures."""
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "https://api.openai.com/v1/chat/completions",
            json={"model": model, "messages": messages},
            headers={"Authorization": "Bearer ..."},
            timeout=30.0,
        )
        if resp.status_code == 429:
            raise RateLimitError("Rate limited")
        if resp.status_code >= 500:
            raise ServerOverloadError(f"Server error: {resp.status_code}")
        resp.raise_for_status()
        return resp.json()["choices"][0]["message"]["content"]
```

## Circuit Breaking: Knowing When to Stop

Retries are only useful when the failure is transient. If the provider is down for an extended period, continuous retries waste resources and increase latency. A circuit breaker stops retries after a threshold of consecutive failures and only allows a test request after a cooldown period.

```python
import time

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, cooldown_seconds: float = 30.0):
        self.failure_threshold = failure_threshold
        self.cooldown_seconds = cooldown_seconds
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.state = "closed"  # closed = healthy, open = broken

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = "open"

    def record_success(self):
        self.failure_count = 0
        self.state = "closed"

    def can_proceed(self) -> bool:
        if self.state == "closed":
            return True
        elapsed = time.time() - self.last_failure_time
        if elapsed >= self.cooldown_seconds:
            self.state = "half-open"
            return True
        return False
```

## FAQ

### What is jitter and why does it matter?

Jitter adds randomness to retry delays. Without it, hundreds of clients that fail simultaneously will retry at the exact same moments (1s, 2s, 4s), creating synchronized traffic spikes that overwhelm the recovering server. Full jitter picks a random delay between 0 and the calculated backoff, spreading retries evenly over time.

### Should I use the Retry-After header from the API?

Absolutely. When an LLM provider returns a 429 with a Retry-After header, always respect that value as your minimum wait time. Combine it with your backoff strategy by using `max(retry_after_value, calculated_backoff)` to ensure you never retry sooner than the server requests.

### How many retries are appropriate for LLM calls?

For synchronous user-facing requests, 3 attempts with a maximum total timeout of 30 seconds is typical. For background processing, 5 to 7 attempts with a maximum backoff of 60 seconds works well. Always set an overall deadline so the total retry sequence cannot exceed your request budget.

---

#RetryPatterns #ExponentialBackoff #Tenacity #LLMAPIs #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/retry-strategies-llm-api-calls-exponential-backoff-jitter-tenacity
