Home > Wiki > Toolchain > Langgraph > Concepts > Fault Tolerance

Fault Tolerance

LangGraph provides three composable mechanisms for building resilient workflows: retries, timeouts, and error handling. These can be layered together for robust execution.

Overview

Mechanism	Purpose	Scope
Retries	Re-attempt failed operations	Per-node
Timeouts	Kill stalled operations	Per-node
Error Handling	Graceful failure recovery	Per-node

All three compose in a defined order: attempt -> retry_policy -> error_handler.

Retries

RetryPolicy

Configure automatic retry behavior per node:

from langgraph.types import RetryPolicy

builder.add_node(
    "llm_call",
    my_llm_node,
    retry_policy=RetryPolicy(
        max_attempts=3,
        initial_interval=1.0,
        backoff_factor=2.0,
        max_interval=30.0,
        jitter=0.1,
        retry_on=(ConnectionError, TimeoutError)
    )
)

Parameter	Description	Default
`max_attempts`	Maximum total attempts (including first)	3
`initial_interval`	Wait before first retry (seconds)	0.5
`backoff_factor`	Exponential multiplier each attempt	2.0
`max_interval`	Maximum wait between retries	60.0
`jitter`	Random variation (0-1 fraction of interval)	0.0
`retry_on`	Exception types to retry on	`(Exception,)`

Default Retry Behavior

Without an explicit RetryPolicy, nodes have a default of max_attempts=1 (no retries). You must opt in.

Custom Retry Logic

For advanced retry decisions, inspect execution state:

def my_node(state, *, runtime):
    info = runtime.execution_info
    print(f"Attempt {info.get('retry_attempt')} of {info.get('max_attempts')}")
    # Custom logic based on attempt count
    if info.get("retry_attempt", 0) > 1:
        # Degrade gracefully on later attempts
        return fallback_response(state)
    return primary_response(state)

Access retry state via runtime.execution_info which exposes retry_attempt, max_attempts, and timing data.

Timeouts

Timeout Parameter

Set a simple timeout in seconds:

builder.add_node("slow_node", my_node, timeout=30)  # 30 second timeout

TimeoutPolicy

For fine-grained control:

from langgraph.types import TimeoutPolicy

builder.add_node(
    "api_node",
    my_node,
    timeout=TimeoutPolicy(
        run_timeout=60,   # max total execution time
        idle_timeout=10   # max time without progress signal
    )
)

Parameter	Description
`run_timeout`	Maximum wall-clock time for the node
`idle_timeout`	Maximum time without a progress signal (heartbeat)

Progress Signals / Heartbeat

Long-running nodes can send heartbeats to reset the idle timer:

def long_running_node(state, *, runtime):
    for item in large_dataset:
        process(item)
        runtime.heartbeat()  # reset idle timeout
    return {"done": True}

NodeTimeoutError

When a timeout fires, a NodeTimeoutError is raised:

from langgraph.errors import NodeTimeoutError

builder.add_node("node", my_node, timeout=10, error_handler=handle_errors)

def handle_errors(state, error):
    if isinstance(error, NodeTimeoutError):
        return {"status": "timeout", "fallback": True}
    raise error

Dynamic Timeouts with Send

Adjust timeouts at runtime:

from langgraph.types import Send

graph.invoke(inputs, config, timeout=Send(get_dynamic_timeout))

Requirements

Async-only: Timeout support requires async graph execution (ainvoke, astream)
Version: Requires langgraph >= 1.2

Error Handling

Error Handler

Attach an error handler to any node:

builder.add_node(
    "risky_node",
    my_node,
    error_handler=my_error_handler
)

def my_error_handler(state, error):
    # error is a NodeError with .node and .error fields
    return {"errors": state.get("errors", []) + [str(error)]}

NodeError Structure

class NodeError:
    node: str      # name of the failing node
    error: Exception  # the original exception

Saga / Compensation Patterns

Use Command in the error handler to route to a compensation flow:

def error_handler(state, error):
    return Command(goto="compensate", update={"failed_node": error.node})

This enables Saga-style rollback: if a node fails, route to a compensation node that undoes prior work.

Subgraph Failures

Errors in subgraphs propagate to the parent graph's error handler. The parent can route to cleanup or retry.

Behavior with interrupt()

If a node has both interrupt() and an error handler: - interrupt() pauses the node (no error) - Errors during execution after the interrupt trigger the error handler - On resume, if an error occurs, the error handler fires

Functional API

The same fault tolerance primitives work with the Functional API:

from langgraph.func import task, entrypoint

@task(retry_policy=RetryPolicy(max_attempts=3), timeout=30)
def fetch_data(input: str) -> dict:
    return call_api(input)

@entrypoint
def workflow(input: str) -> dict:
    return fetch_data(input).result()

@task and @entrypoint both support timeout and retry_policy parameters.

Compose Order

When multiple mechanisms are set on a node, they execute in this order:

attempt (1st try) -> retry_policy (retries) -> error_handler (if all retries fail)

The node attempts execution
On failure, retry_policy determines if/how many retries
If all retries are exhausted, error_handler fires

Limitations

Limitation	Detail
Python only	Fault tolerance APIs are Python-only; no equivalent in JS/TS SDK
Async-only timeouts	Timeout features require async execution (`ainvoke`, `astream`)
One handler per node	Each node can have at most one `error_handler`
No cross-node retry	`RetryPolicy` applies per-node; no native mechanism to retry the entire workflow

Related: Durable Execution, Persistence, Interrupts