---
title: "Error Handling in Agent Workflows: Exceptions, Retries, and Recovery"
description: "Master error handling in the OpenAI Agents SDK. Learn about MaxTurnsExceeded, tool errors, model behavior errors, retry policies, and building resilient agent workflows."
canonical: https://callsphere.ai/blog/openai-agents-sdk-error-handling-exceptions-retries-recovery
category: "Learn Agentic AI"
tags: ["OpenAI", "Error Handling", "Retries", "Production", "Python"]
author: "CallSphere Team"
published: 2026-03-14T00:00:00.000Z
updated: 2026-05-08T09:11:14.875Z
---

# Error Handling in Agent Workflows: Exceptions, Retries, and Recovery

> Master error handling in the OpenAI Agents SDK. Learn about MaxTurnsExceeded, tool errors, model behavior errors, retry policies, and building resilient agent workflows.

## Production Agents Must Handle Failure

In production, things go wrong. APIs time out. Models hallucinate invalid tool arguments. Rate limits hit at peak traffic. Network connections drop. A production-grade agent system must handle all of these failures gracefully.

The OpenAI Agents SDK provides multiple layers of error handling: exception types for different failure modes, tool error recovery within the agent loop, retry policies for transient failures, and hooks for custom error handling logic.

## Exception Types

The SDK defines several exception types that you should handle in your application code:

```mermaid
flowchart LR
    INPUT(["User input"])
    AGENT["Agent
name plus instructions"]
    HAND{"Handoff to
another agent?"}
    SUB["Sub-agent
specialist"]
    GUARD{"Guardrail
passed?"}
    TOOL["Tool call"]
    SDK[("Tracing
OpenAI dashboard")]
    OUT(["Final output"])
    INPUT --> AGENT --> HAND
    HAND -->|Yes| SUB --> GUARD
    HAND -->|No| GUARD
    GUARD -->|Yes| TOOL --> AGENT
    GUARD -->|Block| OUT
    AGENT --> OUT
    AGENT --> SDK
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style SDK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

### MaxTurnsExceeded

Raised when the agent loop exceeds the `max_turns` limit without producing a final output:

```python
from agents import Agent, Runner, MaxTurnsExceeded

agent = Agent(
    name="Research Agent",
    instructions="Research the topic using available tools.",
    tools=[search_tool, analyze_tool],
)

try:
    result = await Runner.run(agent, "Research quantum computing", max_turns=5)
    print(result.final_output)
except MaxTurnsExceeded:
    print("The agent could not complete the task within the turn limit.")
    print("Consider increasing max_turns or simplifying the task.")
```

**When this happens:**

- The task is genuinely complex and requires many tool calls
- The agent is stuck in a loop, calling the same tool repeatedly
- The instructions are ambiguous about when to stop

**How to handle it:**

- Return a graceful error to the user
- Log the partial results for debugging
- Consider retrying with a higher `max_turns` or rephrased input

### ModelBehaviorError

Raised when the model produces output that the SDK cannot process. This is rare with OpenAI models but can occur with third-party providers:

```python
from agents import ModelBehaviorError

try:
    result = await Runner.run(agent, "Process this request")
except ModelBehaviorError as e:
    print(f"Model produced unexpected output: {e}")
    # Log and alert — this usually indicates a model or provider issue
```

### UserError

Raised when the SDK detects incorrect usage in your code, such as misconfigured agents or invalid parameters:

```python
from agents import UserError

try:
    # This would raise UserError if, e.g., output_type is not a valid type
    agent = Agent(name="Test", instructions="Test", output_type="not_a_type")
except UserError as e:
    print(f"Configuration error: {e}")
```

### InputGuardrailTripwireTriggered and OutputGuardrailTripwireTriggered

Raised when input or output guardrails detect content that should not be processed:

```python
from agents import InputGuardrailTripwireTriggered, OutputGuardrailTripwireTriggered

try:
    result = await Runner.run(agent, user_input)
except InputGuardrailTripwireTriggered:
    print("Input was flagged by safety guardrails.")
except OutputGuardrailTripwireTriggered:
    print("Output was flagged by safety guardrails.")
```

## Tool Error Recovery

One of the most powerful features of the agent loop is automatic tool error recovery. When a tool raises an exception, the SDK does not crash. Instead, it:

1. Catches the exception
2. Converts the error message to a string
3. Sends it back to the LLM as the tool result
4. The LLM can then decide how to proceed — retry, try a different approach, or report the error

```python
from agents import function_tool

@function_tool
async def fetch_data(url: str) -> str:
    """Fetch data from a URL.

    Args:
        url: The URL to fetch data from.
    """
    import httpx
    async with httpx.AsyncClient() as client:
        response = await client.get(url, timeout=5)
        response.raise_for_status()
        return response.text[:2000]
```

If the URL is unreachable, the agent sees something like: "Error: Connection timeout after 5 seconds." The agent can then:

- Try a different URL
- Ask the user for a corrected URL
- Report that the data source is unavailable

This self-healing behavior means agents handle many errors without any special error handling code from you.

### Controlling Tool Error Behavior

You can customize how tool errors are reported by catching exceptions inside the tool and returning descriptive error messages:

```python
@function_tool
async def query_database(sql: str) -> str:
    """Execute a read-only SQL query.

    Args:
        sql: The SQL query to execute.
    """
    if not sql.strip().upper().startswith("SELECT"):
        return "Error: Only SELECT queries are allowed for safety."

    try:
        async with get_db_connection() as conn:
            rows = await conn.fetch(sql)
            if not rows:
                return "Query returned no results."
            return format_rows(rows)
    except asyncpg.PostgresError as e:
        return f"Database error: {e}. Please check your query syntax."
    except asyncio.TimeoutError:
        return "Query timed out. Try a simpler query or add LIMIT clause."
```

By catching exceptions and returning clear error messages, you give the LLM the information it needs to self-correct.

## Tool Timeouts

Tools that perform I/O should have timeouts to prevent the agent loop from hanging:

```python
@function_tool(timeout=10)
async def call_external_api(endpoint: str) -> str:
    """Call an external API endpoint.

    Args:
        endpoint: The API endpoint path.
    """
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://api.example.com/{endpoint}")
        return response.text
```

When a tool exceeds its timeout, the SDK raises a `ToolTimeoutError` internally, which is converted to an error message for the LLM. The agent can then decide to retry or skip.

## Retry Policies

The SDK supports configurable retry policies for transient failures at the LLM API level. These retries happen automatically before the error reaches your application code:

```python
from agents import Agent, Runner, RunConfig

result = await Runner.run(
    agent,
    "Process this request",
    run_config=RunConfig(
        model_provider_retry_config={
            "max_retries": 3,
            "initial_delay": 1.0,
            "max_delay": 30.0,
            "backoff_factor": 2.0,
        },
    ),
)
```

The retry policy applies to:

- **Network errors**: Connection refused, DNS failures, timeouts
- **HTTP 429**: Rate limit responses (respects Retry-After header)
- **HTTP 500/502/503**: Server-side errors from the provider

The retries use exponential backoff: first retry after 1 second, second after 2 seconds, third after 4 seconds (capped at 30 seconds).

## Comprehensive Error Handling Pattern

Here is a production-ready error handling pattern that covers all failure modes:

```python
import asyncio
import logging
from agents import (
    Agent,
    Runner,
    MaxTurnsExceeded,
    ModelBehaviorError,
    InputGuardrailTripwireTriggered,
    OutputGuardrailTripwireTriggered,
    RunConfig,
)

logger = logging.getLogger(__name__)

agent = Agent(
    name="Production Agent",
    instructions="You are a helpful assistant.",
    tools=[search_tool, database_tool],
)

async def handle_request(user_input: str, user_id: str) -> dict:
    """Handle a user request with comprehensive error handling."""

    try:
        result = await Runner.run(
            agent,
            user_input,
            run_config=RunConfig(
                max_turns=10,
                workflow_name="customer-request",
            ),
        )

        return {
            "status": "success",
            "response": result.final_output,
            "agent": result.last_agent.name,
        }

    except MaxTurnsExceeded:
        logger.warning(f"Max turns exceeded for user {user_id}", extra={
            "user_id": user_id,
            "input_preview": user_input[:100],
        })
        return {
            "status": "incomplete",
            "response": "I was not able to fully complete your request. Could you try breaking it into smaller questions?",
        }

    except InputGuardrailTripwireTriggered:
        logger.info(f"Input guardrail triggered for user {user_id}")
        return {
            "status": "blocked",
            "response": "I am not able to process that request. Please rephrase your question.",
        }

    except OutputGuardrailTripwireTriggered:
        logger.warning(f"Output guardrail triggered for user {user_id}")
        return {
            "status": "blocked",
            "response": "I generated a response that did not meet our safety guidelines. Please try again.",
        }

    except ModelBehaviorError as e:
        logger.error(f"Model behavior error: {e}", exc_info=True)
        return {
            "status": "error",
            "response": "An unexpected error occurred. Our team has been notified.",
        }

    except Exception as e:
        logger.error(f"Unexpected error for user {user_id}: {e}", exc_info=True)
        return {
            "status": "error",
            "response": "Something went wrong. Please try again later.",
        }
```

## Application-Level Retries

For critical workflows where you need the agent to succeed, implement application-level retries with escalation:

```python
async def robust_agent_call(
    agent: Agent,
    user_input: str,
    max_attempts: int = 3,
) -> str:
    """Run an agent with application-level retries and escalation."""

    last_error = None

    for attempt in range(1, max_attempts + 1):
        try:
            # Increase max_turns with each attempt
            max_turns = 5 * attempt

            result = await Runner.run(
                agent,
                user_input,
                max_turns=max_turns,
            )
            return result.final_output

        except MaxTurnsExceeded:
            last_error = "exceeded_turns"
            logger.info(f"Attempt {attempt}: max turns exceeded, retrying with higher limit")
            continue

        except Exception as e:
            last_error = str(e)
            if attempt < max_attempts:
                wait_time = 2 ** attempt
                logger.info(f"Attempt {attempt} failed: {e}. Retrying in {wait_time}s")
                await asyncio.sleep(wait_time)
            continue

    raise RuntimeError(f"Agent failed after {max_attempts} attempts. Last error: {last_error}")
```

## Best Practices

1. **Always catch MaxTurnsExceeded** in production. It is the most common agent-specific error.
2. **Set appropriate max_turns.** Too low and agents cannot complete complex tasks. Too high and a stuck agent burns through your API budget.
3. **Let tools return error strings** instead of raising exceptions when possible. This gives the LLM a chance to self-correct.
4. **Use tool timeouts for all I/O operations.** A hanging tool blocks the entire agent loop.
5. **Log the full RunResult on errors.** The `new_items` list contains the complete trace of what happened, which is invaluable for debugging.
6. **Implement circuit breakers** for tools that call external services. If a service is down, fail fast rather than burning through retries.
7. **Never expose raw error messages to users.** Map all errors to user-friendly messages.

---

**Source:** [OpenAI Agents SDK — Error Handling](https://openai.github.io/openai-agents-python/running_agents/)

---

Source: https://callsphere.ai/blog/openai-agents-sdk-error-handling-exceptions-retries-recovery
