---
title: "SDK Retry and Error Handling: Building Resilient Client Libraries"
description: "Learn how to implement robust retry policies, error classification, timeout configuration, and structured logging in AI agent SDK client libraries for production reliability."
canonical: https://callsphere.ai/blog/sdk-retry-error-handling-resilient-client-libraries
category: "Learn Agentic AI"
tags: ["Retry Logic", "Error Handling", "SDK Design", "Resilience", "Agentic AI", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T20:39:20.239Z
---

# SDK Retry and Error Handling: Building Resilient Client Libraries

> Learn how to implement robust retry policies, error classification, timeout configuration, and structured logging in AI agent SDK client libraries for production reliability.

## Why SDKs Must Handle Retries

Network requests fail. Servers return 500 errors during deployments. Rate limiters throttle bursts. DNS resolution hiccups. TCP connections reset. If your SDK surfaces every transient failure directly to the user, their application becomes fragile. A production-grade SDK retries transient errors automatically so that intermittent infrastructure issues do not cascade into application failures.

The goal is not to mask errors — it is to absorb noise so that when an error reaches the user, it represents a genuine problem that requires their attention.

## Error Classification

The first step is classifying errors into retryable and non-retryable categories. This classification drives the retry engine:

```mermaid
flowchart TD
    CALL(["Inbound Call"])
    HEALTH{"Primary
agent healthy?"}
    PRIMARY["Primary agent
LLM provider A"]
    SECONDARY["Hot standby
LLM provider B"]
    QUEUE[("Persisted
call state")]
    HUMAN(["Live human
fallback"])
    DONE(["Caller served"])
    CALL --> HEALTH
    HEALTH -->|Yes| PRIMARY
    HEALTH -->|Timeout or 5xx| SECONDARY
    PRIMARY --> QUEUE
    SECONDARY --> QUEUE
    PRIMARY --> DONE
    SECONDARY --> DONE
    SECONDARY -->|Both fail| HUMAN
    style HEALTH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PRIMARY fill:#4f46e5,stroke:#4338ca,color:#fff
    style SECONDARY fill:#0ea5e9,stroke:#0369a1,color:#fff
    style HUMAN fill:#dc2626,stroke:#b91c1c,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
```

```python
from enum import Enum

class ErrorCategory(Enum):
    RETRYABLE = "retryable"
    NON_RETRYABLE = "non_retryable"
    RATE_LIMITED = "rate_limited"

def classify_error(status_code: int | None, exception: Exception | None) -> ErrorCategory:
    """Classify an error to determine retry behavior."""

    # Network-level failures are always retryable
    if exception is not None:
        if isinstance(exception, (ConnectionError, TimeoutError)):
            return ErrorCategory.RETRYABLE
        return ErrorCategory.NON_RETRYABLE

    # HTTP status code classification
    if status_code is not None:
        if status_code == 429:
            return ErrorCategory.RATE_LIMITED
        if status_code in (408, 500, 502, 503, 504):
            return ErrorCategory.RETRYABLE
        if status_code == 409:
            return ErrorCategory.RETRYABLE  # Conflict, often transient
        return ErrorCategory.NON_RETRYABLE

    return ErrorCategory.NON_RETRYABLE
```

The critical distinction: 400 (bad request), 401 (unauthorized), 403 (forbidden), and 404 (not found) are never retried. The user must fix their request or credentials. 500, 502, 503, and 504 are retried because they typically indicate transient server issues. 429 (rate limited) is retried with special handling for the `Retry-After` header.

## Retry Policy Configuration

Users need control over retry behavior. Some applications prefer fast failure; others can tolerate longer wait times for higher reliability:

```python
from dataclasses import dataclass

@dataclass
class RetryPolicy:
    """Configuration for retry behavior."""
    max_retries: int = 3
    initial_delay: float = 0.5       # seconds
    max_delay: float = 30.0          # seconds
    backoff_factor: float = 2.0      # exponential multiplier
    retry_on_status: set[int] = None
    retry_on_timeout: bool = True

    def __post_init__(self):
        if self.retry_on_status is None:
            self.retry_on_status = {408, 429, 500, 502, 503, 504}

    def calculate_delay(self, attempt: int, retry_after: float | None = None) -> float:
        """Calculate delay before next retry with exponential backoff."""
        if retry_after is not None:
            return min(retry_after, self.max_delay)

        delay = self.initial_delay * (self.backoff_factor ** attempt)
        return min(delay, self.max_delay)
```

The `calculate_delay` method implements exponential backoff: 0.5s, 1s, 2s, 4s, and so on up to the maximum. When the server sends a `Retry-After` header, the SDK honors it but caps at `max_delay` to prevent unbounded waits.

## The Retry Engine

The retry engine wraps the HTTP request method and orchestrates classification, backoff, and logging:

```python
import time
import logging

logger = logging.getLogger("myagent")

class RetryableClient:
    def __init__(self, http_client, retry_policy: RetryPolicy | None = None):
        self._http = http_client
        self.retry_policy = retry_policy or RetryPolicy()

    def request_with_retry(self, method: str, url: str, **kwargs) -> Response:
        last_exception = None

        for attempt in range(self.retry_policy.max_retries + 1):
            try:
                response = self._http.request(method, url, **kwargs)

                if response.status_code  float | None:
        header = response.headers.get("Retry-After")
        if header is None:
            return None
        try:
            return float(header)
        except ValueError:
            return None
```

## TypeScript Retry Implementation

The same pattern in TypeScript using async/await:

```typescript
interface RetryConfig {
  maxRetries: number;
  initialDelay: number;
  maxDelay: number;
  backoffFactor: number;
}

const DEFAULT_RETRY: RetryConfig = {
  maxRetries: 3,
  initialDelay: 500,
  maxDelay: 30_000,
  backoffFactor: 2,
};

async function fetchWithRetry(
  url: string,
  init: RequestInit,
  config: RetryConfig = DEFAULT_RETRY,
): Promise {
  let lastError: Error | null = null;

  for (let attempt = 0; attempt  setTimeout(resolve, delay));
    } catch (error) {
      if (error instanceof AgentAPIError) throw error;
      lastError = error as Error;

      if (attempt === config.maxRetries) throw lastError;

      const delay = Math.min(
        config.initialDelay * config.backoffFactor ** attempt,
        config.maxDelay,
      );
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }

  throw lastError ?? new Error('Retry exhausted');
}
```

## Timeout Configuration

Offer multiple timeout levels — connection timeout, read timeout, and total request timeout:

```python
@dataclass
class TimeoutConfig:
    connect: float = 5.0    # seconds to establish connection
    read: float = 30.0      # seconds to read response
    total: float = 60.0     # total request deadline
```

AI agent runs can take 30+ seconds. The SDK should default to generous timeouts for run operations while keeping shorter timeouts for metadata queries.

## FAQ

### Should I add jitter to the backoff delays?

Yes. Without jitter, retrying clients that failed at the same time will retry at the same time, creating a thundering herd. Add random jitter of up to 25% of the calculated delay: `delay = delay * (0.75 + random.random() * 0.5)`. This spreads retry attempts across time and reduces the chance of synchronized retries overwhelming the server.

### How do I prevent retries from masking genuine outages?

Log every retry at warning level with the attempt count, status code, and delay. If the SDK exhausts all retries, raise the final error with context about how many attempts were made. Users can monitor retry logs to detect degradation before it becomes a total outage.

### Should the SDK respect Retry-After headers with very large values?

Cap `Retry-After` at your `max_delay` configuration. A server sending a 300-second `Retry-After` header is likely indicating a prolonged outage. Rather than blocking the user's thread for five minutes, respect your timeout policy and fail with a clear error message suggesting the user retry later.

---

#RetryLogic #ErrorHandling #SDKDesign #Resilience #AgenticAI #Python #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/sdk-retry-error-handling-resilient-client-libraries
