LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay
Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.
TL;DR
A LangGraph agent without a checkpointer is a toy. The moment your agent runs longer than one HTTP request — a multi-turn voice call, a 20-minute scheduling negotiation, an overnight research task with a human approval in the middle — you need durable state. LangGraph's checkpointer is the persistence layer that turns the graph from an in-memory state machine into a resumable one: every node transition is committed to a backing store, every `thread_id` is a long-lived conversation, every checkpoint is a time-travel anchor you can fork into a new branch, replay into an eval dataset, or rewind to debug. This piece is the production playbook: `MemorySaver` for tests, `SqliteSaver` for single-node, `PostgresSaver` for everything serious, the schema you should plan for, and how we use checkpoint history as the substrate for our entire offline eval pipeline on CallSphere's voice and chat agents. Pinned: `langgraph==0.2.x`, `langgraph-checkpoint-postgres==2.0.x`, model `gpt-4o-2024-08-06`, judge `gpt-4.1-2025-04-14`.
Why Checkpointers Are Load-Bearing
The naive mental model — "I will just put the message history in Redis, what is the big deal" — misses three things that the LangGraph checkpoint protocol gives you for free:
- Per-node atomicity. A checkpoint is written after every node executes successfully. If node 7 crashes, you resume at node 7 with the state that node 6 produced, not at the start of the graph.
- Branch and fork semantics. Every checkpoint has a `checkpoint_id`. You can resume from any historical `checkpoint_id` to spawn an alternate timeline — the foundation of time-travel debugging and counterfactual evals.
- Interrupt support. When you `interrupt_before` a node, the checkpointer is what makes "pause indefinitely, resume next Tuesday from the same spot" work without holding a process open.
You cannot reproduce those properties with a plain Redis hash without re-implementing the LangGraph checkpoint protocol. We tried. It cost us six engineer-weeks before we deleted the homegrown version.
The Checkpointer Backends
LangGraph ships three first-party backends and the protocol is open enough that custom backends are a few hundred lines.
| Backend | Throughput | Durability | Multi-node | Best for |
|---|---|---|---|---|
| `MemorySaver` | Highest | None (process-local) | No | Unit tests, ephemeral demos |
| `SqliteSaver` | Medium | Disk | No (file lock) | Single-node prod, local dev, embedded apps |
| `PostgresSaver` | High | Replicated, ACID | Yes | Multi-replica prod, long-running sessions |
| Custom (Mongo/Redis/Dynamo) | Varies | Varies | Yes | When the rest of your stack is already there |
The choice is mostly about your existing infra. If you are running k3s + Postgres like we are, `PostgresSaver` is the default. If you are deploying an embedded agent on a desktop app, `SqliteSaver` is unbeatable.
Wiring a Postgres Checkpointer
Below is the actual setup pattern we use across our healthcare, real estate, sales, salon, IT helpdesk, and after-hours agents. Connection pooling, schema bootstrap, and a thread-per-session model.
```python import os from psycopg_pool import AsyncConnectionPool from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver from langgraph.graph import StateGraph, START, END from langchain_openai import ChatOpenAI
LLM = ChatOpenAI(model="gpt-4o-2024-08-06", temperature=0)
POOL = AsyncConnectionPool( conninfo=os.environ["AGENT_PG_DSN"], min_size=2, max_size=20, kwargs={"autocommit": True, "prepare_threshold": 0}, )
async def get_graph(): saver = AsyncPostgresSaver(POOL) await saver.setup() # idempotent — creates checkpoint tables on first run
builder = StateGraph(AgentState)
# ... add_node / add_edge / conditional_edges as in the architecture post ...
return builder.compile(checkpointer=saver)
in your request handler:
async def handle_turn(thread_id: str, user_message: str): graph = await get_graph() config = {"configurable": {"thread_id": thread_id}} return await graph.ainvoke({"messages": [HumanMessage(content=user_message)]}, config) ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
A few production-grade details:
- `thread_id` is your conversation primary key. We map our internal `session_id` 1:1 with the `thread_id`. Across pod restarts, an inbound webhook for the same session resumes exactly where it left off.
- `saver.setup()` is idempotent. Run it on app start; it creates `checkpoints`, `checkpoint_blobs`, and `checkpoint_writes` tables. Plan for 2–6KB per checkpoint depending on state size.
- Compile the graph once. `builder.compile(checkpointer=saver)` does meaningful work; cache it for the process lifetime.
- Use the async saver in async stacks. The sync `PostgresSaver` will block your event loop and tank tail latency.
The Resume Cycle, Visualized
```mermaid flowchart TD START([Inbound turn]) --> LOAD[Load latest checkpoint by thread_id] LOAD --> RUN[Execute next node] RUN --> WRITE[Commit checkpoint id=N+1] WRITE --> COND{Interrupt or end?} COND -->|interrupt_before next| WAIT[Pause: persist + return to caller] COND -->|done| FIN([Return final state]) COND -->|continue| RUN WAIT --> EXT[External event: human approval, webhook, retry] EXT --> RESUME[Resume with thread_id] RESUME --> LOAD CRASH[Process crash mid-node] -.->|next request| LOAD REPLAY[Eval pipeline: replay checkpoint_id] -.-> LOAD style WAIT fill:#ffd style CRASH fill:#fcc style REPLAY fill:#cfe ```
Figure 1 — Three different recoveries route through the same load-checkpoint primitive: human-in-the-loop resume, crash recovery, and eval replay. That convergence is the whole point of the checkpointer.
Human-in-the-Loop Pause and Resume
Combine `interrupt_before` with the checkpointer and you get an asynchronous approval workflow that survives restarts.
```python graph = builder.compile( checkpointer=saver, interrupt_before=["send_refund"], )
1. caller turn arrives, runs to the interrupt
config = {"configurable": {"thread_id": "session-2841"}} state = await graph.ainvoke({"messages": [HumanMessage(content="Refund order #99")]}, config)
graph paused; the proposed action is in state["messages"][-1]
2. operator reviews in your admin UI; eventually approves.
Either resume as-is, or edit the proposed action first:
await graph.aupdate_state(config, {"approved_by": "ops_user_17"}) final = await graph.ainvoke(None, config) # resume from the same checkpoint ```
This works the same whether the operator approves in 30 seconds or three days. The checkpoint just sits in Postgres. The pod that handles the resume is not necessarily the pod that paused. Multi-replica safety falls out of Postgres's transactional semantics.
We use this pattern on our voice agent demo for any irreversible action — refunds, calendar bookings, outbound emails — and on the after-hours human-handoff queue.
Time-Travel Debugging
Every checkpoint has an ID. `graph.get_state_history(config)` returns the full chronological list. Pick any one and resume.
```python config = {"configurable": {"thread_id": "session-2841"}} history = [s async for s in graph.aget_state_history(config)]
state at every prior step:
for snap in history: print(snap.config["configurable"]["checkpoint_id"], snap.next, snap.values["intent"])
resume from a specific historical checkpoint as a new branch:
fork_cfg = history[3].config # 4 steps ago new_state = await graph.ainvoke({"messages": [HumanMessage(content="Wait, actually...")]}, fork_cfg) ```
The "fork from a historical checkpoint" property is what makes counterfactual debugging tractable. When a customer says "the agent went off the rails after I asked about returns," we walk the history, fork at the turn before, change one input, and watch the alternate timeline. No mocks, no fakes, real graph, real models, real state — just a different branch.
Resuming After a Tool Failure
The most common production failure is a tool call that times out or returns garbage. Without a checkpointer, the agent loses the partial conversation and starts over. With one, the failure is automatically replayable.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```python try: final = await graph.ainvoke(turn_input, config) except ToolTimeoutError: # checkpointer already persisted state through the last successful node. # Wait, retry, or escalate — the graph resumes exactly where it failed. await asyncio.sleep(2) final = await graph.ainvoke(None, config) # None = continue from checkpoint ```
The semantics: `ainvoke(None, config)` means "resume the existing thread from its latest checkpoint without injecting new input." That is the right call after a transient failure; the wrong call is to re-send the original turn, which would re-run the router and possibly take a different path.
We log the `checkpoint_id` at every retry into our trace so we can prove the resume was on the right state.
Replaying Checkpoints into Your Eval Pipeline
Here is the property that justifies the entire checkpointer line item on our infra bill: every historical session is, for free, a candidate eval dataset row.
The pattern:
- A customer reports a bad outcome on `thread_id=session-2841` at `checkpoint_id=cp-17`.
- We pull the state at `cp-17` from Postgres.
- We snapshot the inputs, write a correct reference output (with help from the domain expert who owns the agent), and create a row in our LangSmith regression dataset.
- Every future eval run replays that exact state through the current graph and grades the output against the reference.
```python from langsmith import Client
client = Client() config = {"configurable": {"thread_id": "session-2841"}} history = [s async for s in graph.aget_state_history(config)] target = next(s for s in history if s.config["configurable"]["checkpoint_id"] == "cp-17")
client.create_example( dataset_id=client.read_dataset(dataset_name="voice-regression-suite").id, inputs={ "messages": [m.dict() for m in target.values["messages"]], "intent": target.values.get("intent"), }, outputs={"reference_answer": "Refund of $42.10 applied to original card."}, metadata={ "thread_id": "session-2841", "checkpoint_id": "cp-17", "incident_id": "INC-3019", "agent_version": "voice-2026.05.06", }, ) ```
The dataset row is now a permanent regression test. CI replays it on every PR, the gate flags any drop in score, and we cannot ship that bug again. We currently have 412 voice-agent rows and 287 chat-agent rows sourced this way, and the conversion ratio of "production incident" to "permanent test row" is about 87%.
A Schema Worth Knowing About
`AsyncPostgresSaver.setup()` creates three tables. You should know what is in them before you operate this in production:
| Table | What it stores | Hot path? |
|---|---|---|
| `checkpoints` | One row per (thread_id, checkpoint_ns, checkpoint_id) with serialized state | Yes — read on every resume |
| `checkpoint_blobs` | Channel values, deduplicated by content hash | Yes — joined on read |
| `checkpoint_writes` | Pending writes from in-flight nodes (recovery log) | Mostly write |
Operational notes from running this for ~280k sessions/month:
- Index `thread_id` aggressively — the default schema does, but if you customize, do not break it.
- Set a TTL or archival job. We move threads idle for 30 days to a cold table; sessions older than 6 months are dumped to S3 in case we need them for an audit.
- State size matters. Anything over ~50KB per checkpoint kills your read latency. Keep large blobs (uploaded files, audio) by reference, not by value, in the state.
- Vacuum and analyze. Postgres autovacuum on these tables defaults are fine for low traffic; at our volume we tuned `autovacuum_vacuum_scale_factor` down on the checkpoint tables specifically.
- Backups are non-negotiable. A lost checkpoint table is a lost product.
Honest Tradeoffs
- Latency. Each checkpoint is one Postgres write. With pooled connections we see ~3–6ms median; under contention p99 spikes to 25ms. For voice agents where every ms matters, this is real but acceptable.
- Cost. ~$40/month of Postgres for our session volume. Nothing compared to the rollback cost it prevents.
- Schema migrations. When LangGraph bumps the checkpoint format (rare but it happens), you run `saver.setup()` and tolerate a backfill window. Pin the version in production.
- Custom backends are a trap. We tried Redis once. It worked until it did not — at scale, the lack of transactional multi-key semantics caused intermittent torn writes. Postgres or Sqlite. Choose one.
Frequently Asked Questions
Do I need a checkpointer for short, single-turn agents?
No. If your graph completes in one HTTP request and you have no human-in-the-loop step, the checkpointer is overhead. Add it the moment you need any of: multi-turn sessions, interrupts, eval replay, or crash recovery.
Can I have multiple replicas resuming the same thread?
Yes — Postgres handles the locking. We do recommend application-layer idempotency keys on inbound turns so you do not double-invoke if a webhook retries.
How does this interact with LangSmith tracing?
They are complementary. The checkpoint is the state; the LangSmith trace is the transitions. We tag every checkpoint with the `run_id` of the LangSmith trace that produced it, so given any historical state we can pull the exact trace that got us there.
Can I serialize custom Python objects in state?
Yes, with caveats. LangGraph uses msgpack via a serializer protocol; primitive types and LangChain message objects work out of the box. For custom classes, register a serializer or — better — keep state primitive and dereference custom objects from a separate store at node entry.
What about checkpointing across graph version upgrades?
We pin every thread to a graph version in metadata. When we change the graph (add a node, change a reducer), in-flight threads keep their old graph; new threads get the new one. Forced migrations are possible but rare; usually we just let old sessions drain.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.