Fault Tolerance
LangGraph provides three composable mechanisms for building resilient workflows: retries, timeouts, and error handling. These can be layered together for robust execution.
Overview
| Mechanism | Purpose | Scope |
|---|---|---|
| Retries | Re-attempt failed operations | Per-node |
| Timeouts | Kill stalled operations | Per-node |
| Error Handling | Graceful failure recovery | Per-node |
All three compose in a defined order: attempt -> retry_policy -> error_handler.
Retries
RetryPolicy
Configure automatic retry behavior per node:
from langgraph.types import RetryPolicy
builder.add_node(
"llm_call",
my_llm_node,
retry_policy=RetryPolicy(
max_attempts=3,
initial_interval=1.0,
backoff_factor=2.0,
max_interval=30.0,
jitter=0.1,
retry_on=(ConnectionError, TimeoutError)
)
)
| Parameter | Description | Default |
|---|---|---|
max_attempts |
Maximum total attempts (including first) | 3 |
initial_interval |
Wait before first retry (seconds) | 0.5 |
backoff_factor |
Exponential multiplier each attempt | 2.0 |
max_interval |
Maximum wait between retries | 60.0 |
jitter |
Random variation (0-1 fraction of interval) | 0.0 |
retry_on |
Exception types to retry on | (Exception,) |
Default Retry Behavior
Without an explicit RetryPolicy, nodes have a default of max_attempts=1 (no retries). You must opt in.
Custom Retry Logic
For advanced retry decisions, inspect execution state:
def my_node(state, *, runtime):
info = runtime.execution_info
print(f"Attempt {info.get('retry_attempt')} of {info.get('max_attempts')}")
# Custom logic based on attempt count
if info.get("retry_attempt", 0) > 1:
# Degrade gracefully on later attempts
return fallback_response(state)
return primary_response(state)
Access retry state via
runtime.execution_infowhich exposesretry_attempt,max_attempts, and timing data.
Timeouts
Timeout Parameter
Set a simple timeout in seconds:
builder.add_node("slow_node", my_node, timeout=30) # 30 second timeout
TimeoutPolicy
For fine-grained control:
from langgraph.types import TimeoutPolicy
builder.add_node(
"api_node",
my_node,
timeout=TimeoutPolicy(
run_timeout=60, # max total execution time
idle_timeout=10 # max time without progress signal
)
)
| Parameter | Description |
|---|---|
run_timeout |
Maximum wall-clock time for the node |
idle_timeout |
Maximum time without a progress signal (heartbeat) |
Progress Signals / Heartbeat
Long-running nodes can send heartbeats to reset the idle timer:
def long_running_node(state, *, runtime):
for item in large_dataset:
process(item)
runtime.heartbeat() # reset idle timeout
return {"done": True}
NodeTimeoutError
When a timeout fires, a NodeTimeoutError is raised:
from langgraph.errors import NodeTimeoutError
builder.add_node("node", my_node, timeout=10, error_handler=handle_errors)
def handle_errors(state, error):
if isinstance(error, NodeTimeoutError):
return {"status": "timeout", "fallback": True}
raise error
Dynamic Timeouts with Send
Adjust timeouts at runtime:
from langgraph.types import Send
graph.invoke(inputs, config, timeout=Send(get_dynamic_timeout))
Requirements
- Async-only: Timeout support requires async graph execution (
ainvoke,astream) - Version: Requires
langgraph >= 1.2
Error Handling
Error Handler
Attach an error handler to any node:
builder.add_node(
"risky_node",
my_node,
error_handler=my_error_handler
)
def my_error_handler(state, error):
# error is a NodeError with .node and .error fields
return {"errors": state.get("errors", []) + [str(error)]}
NodeError Structure
class NodeError:
node: str # name of the failing node
error: Exception # the original exception
Saga / Compensation Patterns
Use Command in the error handler to route to a compensation flow:
def error_handler(state, error):
return Command(goto="compensate", update={"failed_node": error.node})
This enables Saga-style rollback: if a node fails, route to a compensation node that undoes prior work.
Subgraph Failures
Errors in subgraphs propagate to the parent graph's error handler. The parent can route to cleanup or retry.
Behavior with interrupt()
If a node has both interrupt() and an error handler:
- interrupt() pauses the node (no error)
- Errors during execution after the interrupt trigger the error handler
- On resume, if an error occurs, the error handler fires
Functional API
The same fault tolerance primitives work with the Functional API:
from langgraph.func import task, entrypoint
@task(retry_policy=RetryPolicy(max_attempts=3), timeout=30)
def fetch_data(input: str) -> dict:
return call_api(input)
@entrypoint
def workflow(input: str) -> dict:
return fetch_data(input).result()
@task and @entrypoint both support timeout and retry_policy parameters.
Compose Order
When multiple mechanisms are set on a node, they execute in this order:
attempt (1st try) -> retry_policy (retries) -> error_handler (if all retries fail)
- The node attempts execution
- On failure,
retry_policydetermines if/how many retries - If all retries are exhausted,
error_handlerfires
Limitations
| Limitation | Detail |
|---|---|
| Python only | Fault tolerance APIs are Python-only; no equivalent in JS/TS SDK |
| Async-only timeouts | Timeout features require async execution (ainvoke, astream) |
| One handler per node | Each node can have at most one error_handler |
| No cross-node retry | RetryPolicy applies per-node; no native mechanism to retry the entire workflow |
Related: Durable Execution, Persistence, Interrupts