Skip to content
Learn Agentic AI
Learn Agentic AI11 min read1 views

Error Handling in LangGraph: Retry Nodes, Fallback Paths, and Recovery

Build resilient LangGraph workflows with try/except patterns in nodes, fallback conditional edges, configurable retry logic, and dead-end recovery strategies for production agent systems.

Errors Are Inevitable in Agent Systems

Agent workflows interact with external systems — LLM APIs, databases, web services, file systems. Any of these can fail. API rate limits, network timeouts, malformed LLM outputs, and tool execution errors are not edge cases — they are normal operating conditions. Production LangGraph workflows must handle errors gracefully rather than crashing and losing all accumulated state.

Error Handling Inside Nodes

The first line of defense is try/except blocks within node functions:

flowchart TD
    START["Error Handling in LangGraph: Retry Nodes, Fallbac…"] --> A
    A["Errors Are Inevitable in Agent Systems"]
    A --> B
    B["Error Handling Inside Nodes"]
    B --> C
    C["Fallback Edges Based on Error State"]
    C --> D
    D["Exponential Backoff Retry"]
    D --> E
    E["Tool Error Recovery"]
    E --> F
    F["Dead-End Detection"]
    F --> G
    G["Combining Checkpointing with Error Reco…"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

class State(TypedDict):
    messages: Annotated[list, add_messages]
    error: str
    retry_count: int

llm = ChatOpenAI(model="gpt-4o-mini")

def call_llm(state: State) -> dict:
    try:
        response = llm.invoke(state["messages"])
        return {
            "messages": [response],
            "error": "",
            "retry_count": state.get("retry_count", 0),
        }
    except Exception as e:
        return {
            "error": str(e),
            "retry_count": state.get("retry_count", 0) + 1,
        }

By catching exceptions and writing error information to state, you keep the graph running and let downstream nodes or routing logic decide how to recover.

Fallback Edges Based on Error State

Use conditional edges to route to different nodes depending on whether an error occurred:

from typing import Literal

def check_error(state: State) -> Literal["retry", "fallback", "continue"]:
    if state.get("error"):
        if state.get("retry_count", 0) < 3:
            return "retry"
        return "fallback"
    return "continue"

def retry_node(state: State) -> dict:
    """Wait briefly and clear the error for retry."""
    import time
    time.sleep(1)  # Back off before retry
    return {"error": ""}

def fallback_node(state: State) -> dict:
    """Provide a graceful degradation response."""
    return {
        "messages": [AIMessage(
            content="I encountered an issue processing your request. "
            "Here is what I can tell you based on available information."
        )],
        "error": "",
    }

builder = StateGraph(State)
builder.add_node("agent", call_llm)
builder.add_node("retry", retry_node)
builder.add_node("fallback", fallback_node)
builder.add_node("respond", lambda s: s)

builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", check_error, {
    "retry": "retry",
    "fallback": "fallback",
    "continue": "respond",
})
builder.add_edge("retry", "agent")  # Loop back for retry
builder.add_edge("fallback", END)
builder.add_edge("respond", END)

graph = builder.compile()

This pattern gives the agent three attempts before falling back to a graceful degradation response.

Exponential Backoff Retry

For more sophisticated retry logic, implement exponential backoff:

import time

def smart_retry(state: State) -> dict:
    count = state.get("retry_count", 0)
    delay = min(2 ** count, 30)  # 1s, 2s, 4s, 8s... max 30s
    time.sleep(delay)
    return {"error": ""}

This prevents overwhelming a failing service with rapid retries while still recovering quickly from transient errors.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Tool Error Recovery

Tools fail frequently — APIs return errors, queries time out, external services go down. Build error handling directly into your tools:

from langchain_core.tools import tool
import httpx

@tool
def fetch_data(url: str) -> str:
    """Fetch data from a URL with error handling."""
    try:
        response = httpx.get(url, timeout=10)
        response.raise_for_status()
        return response.text[:2000]
    except httpx.TimeoutException:
        return "ERROR: Request timed out. The server may be slow or unreachable."
    except httpx.HTTPStatusError as e:
        return f"ERROR: HTTP {e.response.status_code}. The resource may not exist."
    except Exception as e:
        return f"ERROR: {type(e).__name__}: {e}"

Returning error strings instead of raising exceptions lets the LLM see the error and decide how to proceed — perhaps by trying a different URL or rephrasing the query.

Dead-End Detection

Sometimes the agent gets stuck in a loop without making progress. Detect this by tracking state changes:

def detect_stall(state: State) -> Literal["continue", "abort"]:
    messages = state["messages"]
    if len(messages) < 4:
        return "continue"

    # Check if last 3 AI messages are similar (stuck in a loop)
    recent_ai = [
        m.content for m in messages[-6:]
        if isinstance(m, AIMessage)
    ][-3:]

    if len(recent_ai) == 3 and len(set(recent_ai)) == 1:
        return "abort"
    return "continue"

def abort_node(state: State) -> dict:
    return {
        "messages": [AIMessage(
            content="I appear to be stuck. Let me summarize what I have so far "
            "and suggest a different approach."
        )]
    }

Combining Checkpointing with Error Recovery

Checkpointing and error handling work together for maximum resilience:

from langgraph.checkpoint.memory import MemorySaver

memory = MemorySaver()
graph = builder.compile(checkpointer=memory)

config = {"configurable": {"thread_id": "resilient-session"}}

try:
    result = graph.invoke(
        {"messages": [HumanMessage(content="Process this complex request")]},
        config,
    )
except Exception:
    # Graph crashed — but state is checkpointed
    # Resume from last successful node
    result = graph.invoke(None, config)

Even if the entire process crashes, the checkpointed state lets you resume from the last successful node rather than restarting the entire workflow.

FAQ

Should I catch all exceptions in every node?

No. Catch exceptions that you can meaningfully handle — API errors, timeouts, validation failures. Let unexpected errors (programming bugs, out-of-memory) propagate so they surface during development rather than being silently swallowed.

How do I log errors without exposing them to the user?

Write errors to a separate state field like error_log that your response formatting node ignores. Alternatively, use Python logging within nodes to send error details to your observability stack while returning user-friendly messages to state.

Can I set a global timeout for the entire graph execution?

LangGraph does not have a built-in global timeout. Implement it at the application level by running graph.ainvoke() inside an asyncio.wait_for() with your desired timeout. If the timeout triggers, the checkpointed state is still available for later resumption.


#LangGraph #ErrorHandling #RetryLogic #FaultTolerance #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

AI Interview Prep

7 Agentic AI & Multi-Agent System Interview Questions for 2026

Real agentic AI and multi-agent system interview questions from Anthropic, OpenAI, and Microsoft in 2026. Covers agent design patterns, memory systems, safety, orchestration frameworks, tool calling, and evaluation.

Learn Agentic AI

AI Agent Framework Comparison 2026: LangGraph vs CrewAI vs AutoGen vs OpenAI Agents SDK

Side-by-side comparison of the top 4 AI agent frameworks: LangGraph, CrewAI, AutoGen, and OpenAI Agents SDK — architecture, features, production readiness, and when to choose each.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Self-Correcting AI Agents: Reflection, Retry, and Validation Loop Patterns

How to build AI agents that catch and fix their own errors through output validation, reflection prompting, retry with feedback, and graceful escalation when self-correction fails.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.