Fault Tolerance

LangGraph provides three composable mechanisms for building resilient workflows: retries, timeouts, and error handling. These can be layered together for robust execution.

Overview

Mechanism Purpose Scope
Retries Re-attempt failed operations Per-node
Timeouts Kill stalled operations Per-node
Error Handling Graceful failure recovery Per-node

All three compose in a defined order: attempt -> retry_policy -> error_handler.


Retries

RetryPolicy

Configure automatic retry behavior per node:

from langgraph.types import RetryPolicy

builder.add_node(
    "llm_call",
    my_llm_node,
    retry_policy=RetryPolicy(
        max_attempts=3,
        initial_interval=1.0,
        backoff_factor=2.0,
        max_interval=30.0,
        jitter=0.1,
        retry_on=(ConnectionError, TimeoutError)
    )
)
Parameter Description Default
max_attempts Maximum total attempts (including first) 3
initial_interval Wait before first retry (seconds) 0.5
backoff_factor Exponential multiplier each attempt 2.0
max_interval Maximum wait between retries 60.0
jitter Random variation (0-1 fraction of interval) 0.0
retry_on Exception types to retry on (Exception,)

Default Retry Behavior

Without an explicit RetryPolicy, nodes have a default of max_attempts=1 (no retries). You must opt in.

Custom Retry Logic

For advanced retry decisions, inspect execution state:

def my_node(state, *, runtime):
    info = runtime.execution_info
    print(f"Attempt {info.get('retry_attempt')} of {info.get('max_attempts')}")
    # Custom logic based on attempt count
    if info.get("retry_attempt", 0) > 1:
        # Degrade gracefully on later attempts
        return fallback_response(state)
    return primary_response(state)

Access retry state via runtime.execution_info which exposes retry_attempt, max_attempts, and timing data.


Timeouts

Timeout Parameter

Set a simple timeout in seconds:

builder.add_node("slow_node", my_node, timeout=30)  # 30 second timeout

TimeoutPolicy

For fine-grained control:

from langgraph.types import TimeoutPolicy

builder.add_node(
    "api_node",
    my_node,
    timeout=TimeoutPolicy(
        run_timeout=60,   # max total execution time
        idle_timeout=10   # max time without progress signal
    )
)
Parameter Description
run_timeout Maximum wall-clock time for the node
idle_timeout Maximum time without a progress signal (heartbeat)

Progress Signals / Heartbeat

Long-running nodes can send heartbeats to reset the idle timer:

def long_running_node(state, *, runtime):
    for item in large_dataset:
        process(item)
        runtime.heartbeat()  # reset idle timeout
    return {"done": True}

NodeTimeoutError

When a timeout fires, a NodeTimeoutError is raised:

from langgraph.errors import NodeTimeoutError

builder.add_node("node", my_node, timeout=10, error_handler=handle_errors)

def handle_errors(state, error):
    if isinstance(error, NodeTimeoutError):
        return {"status": "timeout", "fallback": True}
    raise error

Dynamic Timeouts with Send

Adjust timeouts at runtime:

from langgraph.types import Send

graph.invoke(inputs, config, timeout=Send(get_dynamic_timeout))

Requirements


Error Handling

Error Handler

Attach an error handler to any node:

builder.add_node(
    "risky_node",
    my_node,
    error_handler=my_error_handler
)

def my_error_handler(state, error):
    # error is a NodeError with .node and .error fields
    return {"errors": state.get("errors", []) + [str(error)]}

NodeError Structure

class NodeError:
    node: str      # name of the failing node
    error: Exception  # the original exception

Saga / Compensation Patterns

Use Command in the error handler to route to a compensation flow:

def error_handler(state, error):
    return Command(goto="compensate", update={"failed_node": error.node})

This enables Saga-style rollback: if a node fails, route to a compensation node that undoes prior work.

Subgraph Failures

Errors in subgraphs propagate to the parent graph's error handler. The parent can route to cleanup or retry.

Behavior with interrupt()

If a node has both interrupt() and an error handler: - interrupt() pauses the node (no error) - Errors during execution after the interrupt trigger the error handler - On resume, if an error occurs, the error handler fires


Functional API

The same fault tolerance primitives work with the Functional API:

from langgraph.func import task, entrypoint

@task(retry_policy=RetryPolicy(max_attempts=3), timeout=30)
def fetch_data(input: str) -> dict:
    return call_api(input)

@entrypoint
def workflow(input: str) -> dict:
    return fetch_data(input).result()

@task and @entrypoint both support timeout and retry_policy parameters.


Compose Order

When multiple mechanisms are set on a node, they execute in this order:

attempt (1st try) -> retry_policy (retries) -> error_handler (if all retries fail)
  1. The node attempts execution
  2. On failure, retry_policy determines if/how many retries
  3. If all retries are exhausted, error_handler fires

Limitations

Limitation Detail
Python only Fault tolerance APIs are Python-only; no equivalent in JS/TS SDK
Async-only timeouts Timeout features require async execution (ainvoke, astream)
One handler per node Each node can have at most one error_handler
No cross-node retry RetryPolicy applies per-node; no native mechanism to retry the entire workflow

Related: Durable Execution, Persistence, Interrupts