Agent Memory in LangGraph 2026: Short-Term, Long-Term, and the Patterns That Survive Production
How short-term (thread-scoped) and long-term (cross-thread) memory actually work in LangGraph, with code, schemas, and the eviction policies that keep cost predictable.
TL;DR
Most "memory" code I see in agent repos is one of two things: a list of past messages crammed back into the prompt every turn (which is not memory, that is a sliding window), or a vector store with no schema, no namespace, and no eviction (which is not memory either, that is a swamp). LangGraph in 2026 separates these cleanly: a checkpointer for short-term, thread-scoped state, and a store (`BaseStore`) for long-term, cross-thread memory with namespaces, semantic search, and explicit write semantics. This post is about the long-term half — how to design schemas with Pydantic, when to use `InMemoryStore` vs `PostgresStore`, how to wire embedding-based retrieval into `asearch`, and how to run the write step as a background extractor so it does not blow up your turn latency. With the right structure, a memory-augmented agent on `gpt-4o-2024-11-20` adds about 80–140 ms p50 to a turn and recovers facts at recall@5 of 0.92 on our internal benchmark; without structure, you get unbounded token growth and contradictions within a week.
Two Kinds of Memory, and Why People Keep Conflating Them
Short-term memory is the conversation you are currently having. It dies (or should die) when the user closes the tab. LangGraph models it as thread-scoped state persisted by a checkpointer keyed on `thread_id`. We covered durable resumption, interrupts, and time-travel for that layer in our LangGraph checkpointer post — read it if you have not, because the rest of this post assumes you understand that checkpointers are not the right place to store "the user's spouse's name."
Long-term memory is everything that should outlive the thread: facts ("the user's preferred name is Sam"), episodes ("on April 12 the user complained about double-billing"), procedures ("when this user asks for an appointment, always check the Tuesday block first"). These belong in a separate substrate — LangGraph calls it a `store` — that is keyed by a namespace (typically `(user_id, memory_type)` or `(org_id, agent_id)`) rather than a thread.
The taxonomy we use in production, borrowed loosely from cognitive science:
| Memory type | What it stores | Example | Write trigger |
|---|---|---|---|
| Semantic | Stable facts about entities | "User Sam works at Acme, prefers Tuesdays" | Post-turn extractor finds a fact |
| Episodic | Time-stamped events | "On 2026-04-12, user reported billing bug INC-2841" | Significant event detected |
| Procedural | Learned skills / behaviors | "For this user, always confirm address before booking" | Reinforcement from feedback |
Conflating these three into one bucket is the single most common reason memory systems get worse over time. A user's current preference (semantic) and a year-old event (episodic) should not compete for the same retrieval slot.
The LangGraph `BaseStore` Interface
The store API is small on purpose. The four methods that matter:
```python
pip install langgraph==0.2.55 langgraph-checkpoint-postgres==2.0.13
from langgraph.store.base import BaseStore from langgraph.store.memory import InMemoryStore from langgraph.store.postgres import PostgresStore
Namespace is a tuple — convention is (user_id, memory_type)
namespace = ("user-7421", "semantic")
Write
await store.aput( namespace, key="preferred_name", value={"fact": "User prefers to be called Sam, not Samantha", "source": "turn-12"}, )
Point read
record = await store.aget(namespace, key="preferred_name")
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Semantic search across the namespace
hits = await store.asearch( namespace, query="what does the user like to be called?", limit=5, )
Delete
await store.adelete(namespace, key="preferred_name") ```
The semantic search is the part teams misuse most. `asearch` only works if you configured the store with an embedding index at construction time — `InMemoryStore()` with no config does prefix matching, not vector retrieval. You have to explicitly opt in.
The Read → Reason → Write Cycle
Here is the loop every memory-aware agent runs per turn. Pin it on a wall.
```mermaid flowchart TD A[User message arrives] --> B[Load short-term state from checkpointer] B --> C[asearch(namespace, query=user_msg) into long-term store] C --> D[Inject top-K memories into system prompt] D --> E[Agent reasons + calls tools + replies] E --> F[Persist short-term state via checkpointer] F --> G[Background task: extract facts from turn] G --> H{New, non-duplicate, non-contradictory?} H -->|yes| I[aput into store with provenance] H -->|no| J[Skip or update existing record] style C fill:#e0f2fe style G fill:#fef3c7 style I fill:#dcfce7 ```
Figure 1 — The two-track memory cycle. Read happens on the critical path; write happens off the critical path. This split is what keeps p50 latency bounded.
The non-obvious property: the write step runs after the response is sent. If you do extraction synchronously, your user pays for it in latency every turn. We push it onto a background task with `asyncio.create_task` (or a proper queue if you have one), and treat memory writes as eventually consistent. A turn that fails to write a memory is a missed opportunity, not a bug.
Schema Design — The Part That Actually Matters
Strings as memories are a trap. They look easy, then they become impossible to deduplicate, contradict, or update. We model every memory as a Pydantic class with explicit fields, then serialize to JSON in the store.
```python from pydantic import BaseModel, Field from typing import Literal from datetime import datetime
class SemanticFact(BaseModel): """A stable fact about the user or their world.""" subject: str = Field(description="Entity the fact is about, e.g. 'user' or 'user.spouse'") predicate: str = Field(description="Relation, e.g. 'preferred_name', 'works_at'") object: str = Field(description="The value of the relation") confidence: float = Field(ge=0.0, le=1.0) source_run_id: str created_at: datetime superseded_by: str | None = None # key of newer fact, if any
class EpisodicEvent(BaseModel): """Something that happened, with a timestamp.""" event_type: Literal["complaint", "purchase", "appointment", "feedback"] summary: str = Field(max_length=400) occurred_at: datetime related_entities: list[str] = [] source_run_id: str ```
Two design choices that pay off:
- Triple-shaped semantic facts (`subject, predicate, object`). This makes deduplication trivial: if a new fact has the same `(subject, predicate)` as an existing one, you have a candidate update — either confirm and bump confidence, or supersede. Free-form strings give you no such handle.
- `source_run_id` on every record. When a user complains "your bot keeps thinking I live in Boston, I moved last year," you can trace the bad fact back to the exact LangSmith run that wrote it. Without provenance, every memory bug is unfalsifiable.
Wiring an Embedding Index Into `PostgresStore`
For anything past a toy, use Postgres. `InMemoryStore` is for tests and notebooks. The Postgres store uses pgvector under the hood for `asearch`.
```python from langgraph.store.postgres import PostgresStore from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # 1536 dims, $0.02/1M tokens
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
store = PostgresStore.from_conn_string( conn_string="postgresql://...", index={ "dims": 1536, "embed": embeddings, "fields": ["summary", "object"], # which Pydantic fields to embed }, ) await store.setup() # creates tables + pgvector index ```
`fields` is the parameter people miss. By default the store would embed the entire JSON blob, which is wasteful and noisy. You almost always want to embed just the human-meaningful text fields, not metadata like timestamps or run IDs.
Backend Comparison
| Backend | Use case | Persistence | Vector search | Cost signal |
|---|---|---|---|---|
| `InMemoryStore` (no index) | Unit tests | Process-local | Prefix only | Free |
| `InMemoryStore` (w/ embeddings) | Notebook prototyping | Process-local | Yes | Embedding API only |
| `PostgresStore` | Production default | Durable | Yes (pgvector) | DB + embeddings |
| `RedisStore` (community) | Low-latency, ephemeral-ish | TTL-based | Via RediSearch | Memory-bound |
| Custom (`BaseStore` subclass) | Existing vector DB | Depends | Yes | Whatever you wire |
For most teams the answer is `PostgresStore` — you already have Postgres, pgvector is fine up to ~10M vectors per index, and operational simplicity beats marginal latency wins. Reach for a dedicated vector DB when you actually measure pgvector falling over, not before.
The Background Extractor
This is the function that turns "stuff that happened in the conversation" into structured memory. We run it after the response is streamed.
```python from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate
extractor_llm = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0) extractor = extractor_llm.with_structured_output(SemanticFact)
EXTRACT_PROMPT = ChatPromptTemplate.from_messages([ ("system", "Extract at most ONE durable, non-trivial fact about the user from this turn. " "Skip greetings, fillers, and ephemeral context. Return null if nothing qualifies."), ("human", "User said: {user_msg}\nAgent replied: {agent_msg}"), ])
async def write_memory(turn, store, namespace, run_id): fact = await (EXTRACT_PROMPT | extractor).ainvoke({ "user_msg": turn.user, "agent_msg": turn.agent, }) if fact is None: return
# Dedup by (subject, predicate)
existing = await store.asearch(
namespace,
query=f"{fact.subject} {fact.predicate}",
limit=3,
)
if any(_same_relation(e.value, fact) for e in existing):
return # already known — could update confidence, but skip for now
fact.source_run_id = run_id
fact.created_at = datetime.utcnow()
await store.aput(namespace, key=f"{fact.subject}:{fact.predicate}", value=fact.model_dump(mode="json"))
```
A few production notes:
- Use a cheap model for extraction. `gpt-4o-mini-2024-07-18` is plenty for "is there a fact in this turn?" Spending `gpt-4o` budget here is a waste.
- Cap to one fact per turn. Otherwise the extractor pads and you get junk memories.
- Dedup before write. A semantic search against the new `(subject, predicate)` pair is cheap and catches 90% of duplicates.
Eviction and Summarization — Cost Predictability
Memory grows. If you never delete, retrieval quality degrades, latency creeps, and your DB bill climbs. We use three eviction policies in combination:
- TTL on episodic events. A complaint from 18 months ago is rarely relevant. We expire `EpisodicEvent` records after 365 days unless explicitly pinned.
- Supersession for semantic facts. When a new `(subject, predicate)` arrives that contradicts an old one, we set `superseded_by` on the old record and prefer the new one in retrieval. We keep the old record for audit, not for inference.
- Compaction. Once per user per week, a background job summarizes the top 50 episodic events into 5–10 compressed records, then deletes the originals. Token footprint drops about 6x with minimal loss of recall on our internal eval.
The combination keeps per-user storage roughly bounded around 200–400 records, regardless of how chatty the user is.
What This Buys You
On our voice agent benchmark — 100 users, 12-turn conversations, fact recall test at turn 10 — switching from a naive "stuff the last 20 messages into the prompt" baseline to the structured store improved recall@5 from 0.61 to 0.92, dropped p50 turn latency from 1,840 ms to 1,420 ms (because the prompt got shorter, not longer), and cut output token cost per turn by 38%. The trick is that good memory is not "remember more" — it is "retrieve only what matters and keep the prompt small."
If you want the eval methodology that produced those numbers, jump to the memory eval pipeline post — memory you cannot measure is memory you cannot trust. And if you are building voice or chat agents in production, this two-track pattern (checkpointer + store) is the single biggest architectural bet that pays off in month three, not week one.
What to Build First
Do not try to ship semantic, episodic, and procedural memory in one PR. The order that works:
- Get the checkpointer right. Re-read the checkpointer post if needed.
- Add `PostgresStore` with one namespace and one Pydantic schema for semantic facts.
- Wire the background extractor with a cheap model and aggressive dedup.
- Add an eval (recall@k against a golden multi-turn dataset) before you tune anything.
- Only then add episodic events, supersession, and compaction.
Ship in that order and you will have a memory system that gets better over time instead of one that quietly accumulates contradictions until someone notices.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.