By Sagar Shankaran, Founder of CallSphere
Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.
Key takeaways
A LangGraph agent without a checkpointer is a toy. The moment your agent runs longer than one HTTP request — a multi-turn voice call, a 20-minute scheduling negotiation, an overnight research task with a human approval in the middle — you need durable state. LangGraph's checkpointer is the persistence layer that turns the graph from an in-memory state machine into a resumable one: every node transition is committed to a backing store, every `thread_id` is a long-lived conversation, every checkpoint is a time-travel anchor you can fork into a new branch, replay into an eval dataset, or rewind to debug. This piece is the production playbook: `MemorySaver` for tests, `SqliteSaver` for single-node, `PostgresSaver` for everything serious, the schema you should plan for, and how we use checkpoint history as the substrate for our entire offline eval pipeline on CallSphere's voice and chat agents. Pinned: `langgraph==0.2.x`, `langgraph-checkpoint-postgres==2.0.x`, model `gpt-4o-2024-08-06`, judge `gpt-4.1-2025-04-14`.
The naive mental model — "I will just put the message history in Redis, what is the big deal" — misses three things that the LangGraph checkpoint protocol gives you for free:
You cannot reproduce those properties with a plain Redis hash without re-implementing the LangGraph checkpoint protocol. We tried. It cost us six engineer-weeks before we deleted the homegrown version.
LangGraph ships three first-party backends and the protocol is open enough that custom backends are a few hundred lines.
| Backend | Throughput | Durability | Multi-node | Best for |
|---|---|---|---|---|
| `MemorySaver` | Highest | None (process-local) | No | Unit tests, ephemeral demos |
| `SqliteSaver` | Medium | Disk | No (file lock) | Single-node prod, local dev, embedded apps |
| `PostgresSaver` | High | Replicated, ACID | Yes | Multi-replica prod, long-running sessions |
| Custom (Mongo/Redis/Dynamo) | Varies | Varies | Yes | When the rest of your stack is already there |
The choice is mostly about your existing infra. If you are running k3s + Postgres like we are, `PostgresSaver` is the default. If you are deploying an embedded agent on a desktop app, `SqliteSaver` is unbeatable.
Below is the actual setup pattern we use across our healthcare, real estate, sales, salon, IT helpdesk, and after-hours agents. Connection pooling, schema bootstrap, and a thread-per-session model.
```python import os from psycopg_pool import AsyncConnectionPool from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver from langgraph.graph import StateGraph, START, END from langchain_openai import ChatOpenAI
LLM = ChatOpenAI(model="gpt-4o-2024-08-06", temperature=0)
POOL = AsyncConnectionPool( conninfo=os.environ["AGENT_PG_DSN"], min_size=2, max_size=20, kwargs={"autocommit": True, "prepare_threshold": 0}, )
async def get_graph(): saver = AsyncPostgresSaver(POOL) await saver.setup() # idempotent — creates checkpoint tables on first run
builder = StateGraph(AgentState)
# ... add_node / add_edge / conditional_edges as in the architecture post ...
return builder.compile(checkpointer=saver)
async def handle_turn(thread_id: str, user_message: str): graph = await get_graph() config = {"configurable": {"thread_id": thread_id}} return await graph.ainvoke({"messages": [HumanMessage(content=user_message)]}, config) ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
A few production-grade details:
```mermaid flowchart TD START([Inbound turn]) --> LOAD[Load latest checkpoint by thread_id] LOAD --> RUN[Execute next node] RUN --> WRITE[Commit checkpoint id=N+1] WRITE --> COND{Interrupt or end?} COND -->|interrupt_before next| WAIT[Pause: persist + return to caller] COND -->|done| FIN([Return final state]) COND -->|continue| RUN WAIT --> EXT[External event: human approval, webhook, retry] EXT --> RESUME[Resume with thread_id] RESUME --> LOAD CRASH[Process crash mid-node] -.->|next request| LOAD REPLAY[Eval pipeline: replay checkpoint_id] -.-> LOAD style WAIT fill:#ffd style CRASH fill:#fcc style REPLAY fill:#cfe ```
Figure 1 — Three different recoveries route through the same load-checkpoint primitive: human-in-the-loop resume, crash recovery, and eval replay. That convergence is the whole point of the checkpointer.
Combine `interrupt_before` with the checkpointer and you get an asynchronous approval workflow that survives restarts.
```python graph = builder.compile( checkpointer=saver, interrupt_before=["send_refund"], )
config = {"configurable": {"thread_id": "session-2841"}} state = await graph.ainvoke({"messages": [HumanMessage(content="Refund order #99")]}, config)
await graph.aupdate_state(config, {"approved_by": "ops_user_17"}) final = await graph.ainvoke(None, config) # resume from the same checkpoint ```
This works the same whether the operator approves in 30 seconds or three days. The checkpoint just sits in Postgres. The pod that handles the resume is not necessarily the pod that paused. Multi-replica safety falls out of Postgres's transactional semantics.
We use this pattern on our voice agent demo for any irreversible action — refunds, calendar bookings, outbound emails — and on the after-hours human-handoff queue.
Every checkpoint has an ID. `graph.get_state_history(config)` returns the full chronological list. Pick any one and resume.
```python config = {"configurable": {"thread_id": "session-2841"}} history = [s async for s in graph.aget_state_history(config)]
for snap in history: print(snap.config["configurable"]["checkpoint_id"], snap.next, snap.values["intent"])
fork_cfg = history[3].config # 4 steps ago new_state = await graph.ainvoke({"messages": [HumanMessage(content="Wait, actually...")]}, fork_cfg) ```
The "fork from a historical checkpoint" property is what makes counterfactual debugging tractable. When a customer says "the agent went off the rails after I asked about returns," we walk the history, fork at the turn before, change one input, and watch the alternate timeline. No mocks, no fakes, real graph, real models, real state — just a different branch.
The most common production failure is a tool call that times out or returns garbage. Without a checkpointer, the agent loses the partial conversation and starts over. With one, the failure is automatically replayable.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
```python try: final = await graph.ainvoke(turn_input, config) except ToolTimeoutError: # checkpointer already persisted state through the last successful node. # Wait, retry, or escalate — the graph resumes exactly where it failed. await asyncio.sleep(2) final = await graph.ainvoke(None, config) # None = continue from checkpoint ```
The semantics: `ainvoke(None, config)` means "resume the existing thread from its latest checkpoint without injecting new input." That is the right call after a transient failure; the wrong call is to re-send the original turn, which would re-run the router and possibly take a different path.
We log the `checkpoint_id` at every retry into our trace so we can prove the resume was on the right state.
Here is the property that justifies the entire checkpointer line item on our infra bill: every historical session is, for free, a candidate eval dataset row.
The pattern:
```python from langsmith import Client
client = Client() config = {"configurable": {"thread_id": "session-2841"}} history = [s async for s in graph.aget_state_history(config)] target = next(s for s in history if s.config["configurable"]["checkpoint_id"] == "cp-17")
client.create_example( dataset_id=client.read_dataset(dataset_name="voice-regression-suite").id, inputs={ "messages": [m.dict() for m in target.values["messages"]], "intent": target.values.get("intent"), }, outputs={"reference_answer": "Refund of $42.10 applied to original card."}, metadata={ "thread_id": "session-2841", "checkpoint_id": "cp-17", "incident_id": "INC-3019", "agent_version": "voice-2026.05.06", }, ) ```
The dataset row is now a permanent regression test. CI replays it on every PR, the gate flags any drop in score, and we cannot ship that bug again. We currently have 412 voice-agent rows and 287 chat-agent rows sourced this way, and the conversion ratio of "production incident" to "permanent test row" is about 87%.
`AsyncPostgresSaver.setup()` creates three tables. You should know what is in them before you operate this in production:
| Table | What it stores | Hot path? |
|---|---|---|
| `checkpoints` | One row per (thread_id, checkpoint_ns, checkpoint_id) with serialized state | Yes — read on every resume |
| `checkpoint_blobs` | Channel values, deduplicated by content hash | Yes — joined on read |
| `checkpoint_writes` | Pending writes from in-flight nodes (recovery log) | Mostly write |
Operational notes from running this for ~280k sessions/month:
No. If your graph completes in one HTTP request and you have no human-in-the-loop step, the checkpointer is overhead. Add it the moment you need any of: multi-turn sessions, interrupts, eval replay, or crash recovery.
Yes — Postgres handles the locking. We do recommend application-layer idempotency keys on inbound turns so you do not double-invoke if a webhook retries.
They are complementary. The checkpoint is the state; the LangSmith trace is the transitions. We tag every checkpoint with the `run_id` of the LangSmith trace that produced it, so given any historical state we can pull the exact trace that got us there.
Yes, with caveats. LangGraph uses msgpack via a serializer protocol; primitive types and LangChain message objects work out of the box. For custom classes, register a serializer or — better — keep state primitive and dereference custom objects from a separate store at node entry.
We pin every thread to a graph version in metadata. When we change the graph (add a node, change a reducer), in-flight threads keep their old graph; new threads get the new one. Forced migrations are possible but rare; usually we just let old sessions drain.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
A founder's guide to the personal AI assistant market: best AI assistant apps, business-grade options, and how CallSphere's voice agent fits in.
A founder's guide to free AI agents, low-code AI agent builders, and how to know when you should pay for a real platform like CallSphere.
Graphiti is the open-source temporal knowledge graph for AI agents in 2026. Learn how bi-temporal memory beats vector RAG for voice agents and long-running LLMs.
Chatbot app vs ChatGPT in 2026: a founder's clear take on the difference, when to use which, and how a real AI chatbot app development works.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI