Error Handling in LangGraph: Retry Nodes, Fallback Paths, and Recovery

Errors Are Inevitable in Agent Systems

Agent workflows interact with external systems — LLM APIs, databases, web services, file systems. Any of these can fail. API rate limits, network timeouts, malformed LLM outputs, and tool execution errors are not edge cases — they are normal operating conditions. Production LangGraph workflows must handle errors gracefully rather than crashing and losing all accumulated state.

Error Handling Inside Nodes

The first line of defense is try/except blocks within node functions:

flowchart TD
    USER(["User input"])
    SUPER["Supervisor node<br/>routes by state"]
    A["Specialist node A<br/>research"]
    B["Specialist node B<br/>writing"]
    TOOL{"Tool call<br/>needed?"}
    EXEC["Tool executor<br/>ToolNode"]
    CHK[("Postgres<br/>checkpointer")]
    INT{"interrupt for<br/>human approval?"}
    HUMAN(["Human reviewer"])
    OUT(["Final response"])
    USER --> SUPER
    SUPER --> A
    SUPER --> B
    A --> TOOL
    B --> TOOL
    TOOL -->|Yes| EXEC --> SUPER
    TOOL -->|No| INT
    INT -->|Yes| HUMAN --> SUPER
    INT -->|No| OUT
    SUPER <--> CHK
    style SUPER fill:#4f46e5,stroke:#4338ca,color:#fff
    style CHK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
    style HUMAN fill:#f59e0b,stroke:#d97706,color:#1f2937

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

class State(TypedDict):
    messages: Annotated[list, add_messages]
    error: str
    retry_count: int

llm = ChatOpenAI(model="gpt-4o-mini")

def call_llm(state: State) -> dict:
    try:
        response = llm.invoke(state["messages"])
        return {
            "messages": [response],
            "error": "",
            "retry_count": state.get("retry_count", 0),
        }
    except Exception as e:
        return {
            "error": str(e),
            "retry_count": state.get("retry_count", 0) + 1,
        }

By catching exceptions and writing error information to state, you keep the graph running and let downstream nodes or routing logic decide how to recover.

Fallback Edges Based on Error State

Use conditional edges to route to different nodes depending on whether an error occurred:

from typing import Literal

def check_error(state: State) -> Literal["retry", "fallback", "continue"]:
    if state.get("error"):
        if state.get("retry_count", 0) < 3:
            return "retry"
        return "fallback"
    return "continue"

def retry_node(state: State) -> dict:
    """Wait briefly and clear the error for retry."""
    import time
    time.sleep(1)  # Back off before retry
    return {"error": ""}

def fallback_node(state: State) -> dict:
    """Provide a graceful degradation response."""
    return {
        "messages": [AIMessage(
            content="I encountered an issue processing your request. "
            "Here is what I can tell you based on available information."
        )],
        "error": "",
    }

builder = StateGraph(State)
builder.add_node("agent", call_llm)
builder.add_node("retry", retry_node)
builder.add_node("fallback", fallback_node)
builder.add_node("respond", lambda s: s)

builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", check_error, {
    "retry": "retry",
    "fallback": "fallback",
    "continue": "respond",
})
builder.add_edge("retry", "agent")  # Loop back for retry
builder.add_edge("fallback", END)
builder.add_edge("respond", END)

graph = builder.compile()

This pattern gives the agent three attempts before falling back to a graceful degradation response.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Exponential Backoff Retry

For more sophisticated retry logic, implement exponential backoff:

import time

def smart_retry(state: State) -> dict:
    count = state.get("retry_count", 0)
    delay = min(2 ** count, 30)  # 1s, 2s, 4s, 8s... max 30s
    time.sleep(delay)
    return {"error": ""}

This prevents overwhelming a failing service with rapid retries while still recovering quickly from transient errors.

Tool Error Recovery

Tools fail frequently — APIs return errors, queries time out, external services go down. Build error handling directly into your tools:

from langchain_core.tools import tool
import httpx

@tool
def fetch_data(url: str) -> str:
    """Fetch data from a URL with error handling."""
    try:
        response = httpx.get(url, timeout=10)
        response.raise_for_status()
        return response.text[:2000]
    except httpx.TimeoutException:
        return "ERROR: Request timed out. The server may be slow or unreachable."
    except httpx.HTTPStatusError as e:
        return f"ERROR: HTTP {e.response.status_code}. The resource may not exist."
    except Exception as e:
        return f"ERROR: {type(e).__name__}: {e}"

Returning error strings instead of raising exceptions lets the LLM see the error and decide how to proceed — perhaps by trying a different URL or rephrasing the query.

Dead-End Detection

Sometimes the agent gets stuck in a loop without making progress. Detect this by tracking state changes:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

def detect_stall(state: State) -> Literal["continue", "abort"]:
    messages = state["messages"]
    if len(messages) < 4:
        return "continue"

    # Check if last 3 AI messages are similar (stuck in a loop)
    recent_ai = [
        m.content for m in messages[-6:]
        if isinstance(m, AIMessage)
    ][-3:]

    if len(recent_ai) == 3 and len(set(recent_ai)) == 1:
        return "abort"
    return "continue"

def abort_node(state: State) -> dict:
    return {
        "messages": [AIMessage(
            content="I appear to be stuck. Let me summarize what I have so far "
            "and suggest a different approach."
        )]
    }

Combining Checkpointing with Error Recovery

Checkpointing and error handling work together for maximum resilience:

from langgraph.checkpoint.memory import MemorySaver

memory = MemorySaver()
graph = builder.compile(checkpointer=memory)

config = {"configurable": {"thread_id": "resilient-session"}}

try:
    result = graph.invoke(
        {"messages": [HumanMessage(content="Process this complex request")]},
        config,
    )
except Exception:
    # Graph crashed — but state is checkpointed
    # Resume from last successful node
    result = graph.invoke(None, config)

Even if the entire process crashes, the checkpointed state lets you resume from the last successful node rather than restarting the entire workflow.

FAQ

Should I catch all exceptions in every node?

No. Catch exceptions that you can meaningfully handle — API errors, timeouts, validation failures. Let unexpected errors (programming bugs, out-of-memory) propagate so they surface during development rather than being silently swallowed.

How do I log errors without exposing them to the user?

Write errors to a separate state field like error_log that your response formatting node ignores. Alternatively, use Python logging within nodes to send error details to your observability stack while returning user-friendly messages to state.

Can I set a global timeout for the entire graph execution?

LangGraph does not have a built-in global timeout. Implement it at the application level by running graph.ainvoke() inside an asyncio.wait_for() with your desired timeout. If the timeout triggers, the checkpointed state is still available for later resumption.

#LangGraph #ErrorHandling #RetryLogic #FaultTolerance #Python #AgenticAI #LearnAI #AIEngineering

Error Handling in LangGraph: Retry Nodes, Fallback Paths, and Recovery

Errors Are Inevitable in Agent Systems

Error Handling Inside Nodes

Fallback Edges Based on Error State

Exponential Backoff Retry

Tool Error Recovery

Dead-End Detection

Combining Checkpointing with Error Recovery

FAQ

Should I catch all exceptions in every node?

How do I log errors without exposing them to the user?

Can I set a global timeout for the entire graph execution?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building an HVAC After-Hours Emergency Escalation System: A Complete Engineering Guide

Human-in-the-Loop Hybrid Agents: 73% Fewer Errors in 2026

Evaluating Agent Memory: Recall, Precision, and the Eval Pipeline Most Teams Don't Build

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection