Retry Strategies for LLM API Calls: Exponential Backoff with Jitter and Tenacity

The Problem with Naive Retries

LLM API calls fail regularly. Rate limits, server overload, network blips, and cold start latency all cause intermittent errors. The instinct is to wrap the call in a while loop with a sleep, but naive retries create serious problems: they hammer the already-stressed API, synchronize retry storms across clients, and can rack up costs by resending expensive prompts repeatedly.

Production agents need structured retry strategies that maximize success probability while minimizing waste.

Understanding Backoff Algorithms

Fixed Delay

The simplest approach — wait a constant duration between retries. This works for isolated scripts but fails in production because all clients retry at the same intervals, creating synchronized load spikes.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
    CALL(["Inbound Call"])
    HEALTH{"Primary<br/>agent healthy?"}
    PRIMARY["Primary agent<br/>LLM provider A"]
    SECONDARY["Hot standby<br/>LLM provider B"]
    QUEUE[("Persisted<br/>call state")]
    HUMAN(["Live human<br/>fallback"])
    DONE(["Caller served"])
    CALL --> HEALTH
    HEALTH -->|Yes| PRIMARY
    HEALTH -->|Timeout or 5xx| SECONDARY
    PRIMARY --> QUEUE
    SECONDARY --> QUEUE
    PRIMARY --> DONE
    SECONDARY --> DONE
    SECONDARY -->|Both fail| HUMAN
    style HEALTH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PRIMARY fill:#4f46e5,stroke:#4338ca,color:#fff
    style SECONDARY fill:#0ea5e9,stroke:#0369a1,color:#fff
    style HUMAN fill:#dc2626,stroke:#b91c1c,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Exponential Backoff

Each retry waits exponentially longer: 1s, 2s, 4s, 8s, 16s. This gives the overloaded service time to recover. However, if many clients start failing at the same time, they all retry at the same exponential intervals.

Exponential Backoff with Jitter

Adding randomness (jitter) to the backoff interval desynchronizes clients. This is the gold standard for distributed systems.

import random
import time
import httpx

def exponential_backoff_with_jitter(
    attempt: int,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> float:
    """Calculate delay with full jitter strategy."""
    exp_delay = base_delay * (2 ** attempt)
    capped = min(exp_delay, max_delay)
    return random.uniform(0, capped)

def call_llm_with_retry(
    prompt: str,
    max_attempts: int = 5,
    retryable_status_codes: set = None,
) -> dict:
    if retryable_status_codes is None:
        retryable_status_codes = {429, 500, 502, 503, 504}

    last_exception = None
    for attempt in range(max_attempts):
        try:
            response = httpx.post(
                "https://api.openai.com/v1/chat/completions",
                json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
                headers={"Authorization": "Bearer ..."},
                timeout=30.0,
            )
            if response.status_code == 200:
                return response.json()
            if response.status_code not in retryable_status_codes:
                raise RuntimeError(f"Non-retryable status: {response.status_code}")

            delay = exponential_backoff_with_jitter(attempt)
            print(f"Attempt {attempt + 1} got {response.status_code}, retrying in {delay:.1f}s")
            time.sleep(delay)

        except (httpx.ConnectTimeout, httpx.ReadTimeout) as exc:
            last_exception = exc
            delay = exponential_backoff_with_jitter(attempt)
            time.sleep(delay)

    raise RuntimeError(f"All {max_attempts} attempts failed") from last_exception

Using Tenacity for Production Retries

The Tenacity library provides a declarative, composable retry framework that eliminates boilerplate.

from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential_jitter,
    retry_if_exception_type,
    before_sleep_log,
    after_log,
)
import logging

logger = logging.getLogger("agent.llm")

class RateLimitError(Exception):
    pass

class ServerOverloadError(Exception):
    pass

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential_jitter(
        initial=1,
        max=60,
        jitter=5,
    ),
    retry=retry_if_exception_type((RateLimitError, ServerOverloadError, TimeoutError)),
    before_sleep=before_sleep_log(logger, logging.WARNING),
    after=after_log(logger, logging.INFO),
    reraise=True,
)
async def call_llm(messages: list[dict], model: str = "gpt-4o") -> str:
    """Call LLM with automatic retry on transient failures."""
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "https://api.openai.com/v1/chat/completions",
            json={"model": model, "messages": messages},
            headers={"Authorization": "Bearer ..."},
            timeout=30.0,
        )
        if resp.status_code == 429:
            raise RateLimitError("Rate limited")
        if resp.status_code >= 500:
            raise ServerOverloadError(f"Server error: {resp.status_code}")
        resp.raise_for_status()
        return resp.json()["choices"][0]["message"]["content"]

Circuit Breaking: Knowing When to Stop

Retries are only useful when the failure is transient. If the provider is down for an extended period, continuous retries waste resources and increase latency. A circuit breaker stops retries after a threshold of consecutive failures and only allows a test request after a cooldown period.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

import time

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, cooldown_seconds: float = 30.0):
        self.failure_threshold = failure_threshold
        self.cooldown_seconds = cooldown_seconds
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.state = "closed"  # closed = healthy, open = broken

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = "open"

    def record_success(self):
        self.failure_count = 0
        self.state = "closed"

    def can_proceed(self) -> bool:
        if self.state == "closed":
            return True
        elapsed = time.time() - self.last_failure_time
        if elapsed >= self.cooldown_seconds:
            self.state = "half-open"
            return True
        return False

FAQ

What is jitter and why does it matter?

Jitter adds randomness to retry delays. Without it, hundreds of clients that fail simultaneously will retry at the exact same moments (1s, 2s, 4s), creating synchronized traffic spikes that overwhelm the recovering server. Full jitter picks a random delay between 0 and the calculated backoff, spreading retries evenly over time.

Should I use the Retry-After header from the API?

Absolutely. When an LLM provider returns a 429 with a Retry-After header, always respect that value as your minimum wait time. Combine it with your backoff strategy by using max(retry_after_value, calculated_backoff) to ensure you never retry sooner than the server requests.

How many retries are appropriate for LLM calls?

For synchronous user-facing requests, 3 attempts with a maximum total timeout of 30 seconds is typical. For background processing, 5 to 7 attempts with a maximum backoff of 60 seconds works well. Always set an overall deadline so the total retry sequence cannot exceed your request budget.

#RetryPatterns #ExponentialBackoff #Tenacity #LLMAPIs #Python #AgenticAI #LearnAI #AIEngineering

Retry Strategies for LLM API Calls: Exponential Backoff with Jitter and Tenacity

The Problem with Naive Retries

Understanding Backoff Algorithms

Fixed Delay

Exponential Backoff

Exponential Backoff with Jitter

Using Tenacity for Production Retries

Circuit Breaking: Knowing When to Stop

FAQ

What is jitter and why does it matter?

Should I use the Retry-After header from the API?

How many retries are appropriate for LLM calls?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Deploy a Voice Agent on Modal with Python and Serverless GPU

Pydantic AI April 2026 Update: Typed Agents and Structured Tools

How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings

Docker Multi-Stage AI Agent Images: uv + Distroless = 80MB (2026)