TL;DR

Most "memory" code I see in agent repos is one of two things: a list of past messages crammed back into the prompt every turn (which is not memory, that is a sliding window), or a vector store with no schema, no namespace, and no eviction (which is not memory either, that is a swamp). LangGraph in 2026 separates these cleanly: a checkpointer for short-term, thread-scoped state, and a store (`BaseStore`) for long-term, cross-thread memory with namespaces, semantic search, and explicit write semantics. This post is about the long-term half — how to design schemas with Pydantic, when to use `InMemoryStore` vs `PostgresStore`, how to wire embedding-based retrieval into `asearch`, and how to run the write step as a background extractor so it does not blow up your turn latency. With the right structure, a memory-augmented agent on `gpt-4o-2024-11-20` adds about 80–140 ms p50 to a turn and recovers facts at recall@5 of 0.92 on our internal benchmark; without structure, you get unbounded token growth and contradictions within a week.

Two Kinds of Memory, and Why People Keep Conflating Them

Short-term memory is the conversation you are currently having. It dies (or should die) when the user closes the tab. LangGraph models it as thread-scoped state persisted by a checkpointer keyed on `thread_id`. We covered durable resumption, interrupts, and time-travel for that layer in our LangGraph checkpointer post — read it if you have not, because the rest of this post assumes you understand that checkpointers are not the right place to store "the user's spouse's name."

Long-term memory is everything that should outlive the thread: facts ("the user's preferred name is Sam"), episodes ("on April 12 the user complained about double-billing"), procedures ("when this user asks for an appointment, always check the Tuesday block first"). These belong in a separate substrate — LangGraph calls it a `store` — that is keyed by a namespace (typically `(user_id, memory_type)` or `(org_id, agent_id)`) rather than a thread.

The taxonomy we use in production, borrowed loosely from cognitive science:

Memory type	What it stores	Example	Write trigger
Semantic	Stable facts about entities	"User Sam works at Acme, prefers Tuesdays"	Post-turn extractor finds a fact
Episodic	Time-stamped events	"On 2026-04-12, user reported billing bug INC-2841"	Significant event detected
Procedural	Learned skills / behaviors	"For this user, always confirm address before booking"	Reinforcement from feedback

Conflating these three into one bucket is the single most common reason memory systems get worse over time. A user's current preference (semantic) and a year-old event (episodic) should not compete for the same retrieval slot.

The LangGraph `BaseStore` Interface

The store API is small on purpose. The four methods that matter:

```python

pip install langgraph==0.2.55 langgraph-checkpoint-postgres==2.0.13

from langgraph.store.base import BaseStore from langgraph.store.memory import InMemoryStore from langgraph.store.postgres import PostgresStore

Namespace is a tuple — convention is (user_id, memory_type)

namespace = ("user-7421", "semantic")

Write

await store.aput( namespace, key="preferred_name", value={"fact": "User prefers to be called Sam, not Samantha", "source": "turn-12"}, )

Point read

record = await store.aget(namespace, key="preferred_name")

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Semantic search across the namespace

hits = await store.asearch( namespace, query="what does the user like to be called?", limit=5, )

Delete

await store.adelete(namespace, key="preferred_name") ```

The semantic search is the part teams misuse most. `asearch` only works if you configured the store with an embedding index at construction time — `InMemoryStore()` with no config does prefix matching, not vector retrieval. You have to explicitly opt in.

The Read → Reason → Write Cycle

Here is the loop every memory-aware agent runs per turn. Pin it on a wall.

```mermaid flowchart TD A[User message arrives] --> B[Load short-term state from checkpointer] B --> C[asearch(namespace, query=user_msg) into long-term store] C --> D[Inject top-K memories into system prompt] D --> E[Agent reasons + calls tools + replies] E --> F[Persist short-term state via checkpointer] F --> G[Background task: extract facts from turn] G --> H{New, non-duplicate, non-contradictory?} H -->|yes| I[aput into store with provenance] H -->|no| J[Skip or update existing record] style C fill:#e0f2fe style G fill:#fef3c7 style I fill:#dcfce7 ```

Figure 1 — The two-track memory cycle. Read happens on the critical path; write happens off the critical path. This split is what keeps p50 latency bounded.

The non-obvious property: the write step runs after the response is sent. If you do extraction synchronously, your user pays for it in latency every turn. We push it onto a background task with `asyncio.create_task` (or a proper queue if you have one), and treat memory writes as eventually consistent. A turn that fails to write a memory is a missed opportunity, not a bug.

Schema Design — The Part That Actually Matters

Strings as memories are a trap. They look easy, then they become impossible to deduplicate, contradict, or update. We model every memory as a Pydantic class with explicit fields, then serialize to JSON in the store.

```python from pydantic import BaseModel, Field from typing import Literal from datetime import datetime

class SemanticFact(BaseModel): """A stable fact about the user or their world.""" subject: str = Field(description="Entity the fact is about, e.g. 'user' or 'user.spouse'") predicate: str = Field(description="Relation, e.g. 'preferred_name', 'works_at'") object: str = Field(description="The value of the relation") confidence: float = Field(ge=0.0, le=1.0) source_run_id: str created_at: datetime superseded_by: str | None = None # key of newer fact, if any

class EpisodicEvent(BaseModel): """Something that happened, with a timestamp.""" event_type: Literal["complaint", "purchase", "appointment", "feedback"] summary: str = Field(max_length=400) occurred_at: datetime related_entities: list[str] = [] source_run_id: str ```

Two design choices that pay off:

Triple-shaped semantic facts (`subject, predicate, object`). This makes deduplication trivial: if a new fact has the same `(subject, predicate)` as an existing one, you have a candidate update — either confirm and bump confidence, or supersede. Free-form strings give you no such handle.
`source_run_id` on every record. When a user complains "your bot keeps thinking I live in Boston, I moved last year," you can trace the bad fact back to the exact LangSmith run that wrote it. Without provenance, every memory bug is unfalsifiable.

Wiring an Embedding Index Into `PostgresStore`

For anything past a toy, use Postgres. `InMemoryStore` is for tests and notebooks. The Postgres store uses pgvector under the hood for `asearch`.

```python from langgraph.store.postgres import PostgresStore from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # 1536 dims, $0.02/1M tokens

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

store = PostgresStore.from_conn_string( conn_string="postgresql://...", index={ "dims": 1536, "embed": embeddings, "fields": ["summary", "object"], # which Pydantic fields to embed }, ) await store.setup() # creates tables + pgvector index ```

`fields` is the parameter people miss. By default the store would embed the entire JSON blob, which is wasteful and noisy. You almost always want to embed just the human-meaningful text fields, not metadata like timestamps or run IDs.

Backend Comparison

Backend	Use case	Persistence	Vector search	Cost signal
`InMemoryStore` (no index)	Unit tests	Process-local	Prefix only	Free
`InMemoryStore` (w/ embeddings)	Notebook prototyping	Process-local	Yes	Embedding API only
`PostgresStore`	Production default	Durable	Yes (pgvector)	DB + embeddings
`RedisStore` (community)	Low-latency, ephemeral-ish	TTL-based	Via RediSearch	Memory-bound
Custom (`BaseStore` subclass)	Existing vector DB	Depends	Yes	Whatever you wire

For most teams the answer is `PostgresStore` — you already have Postgres, pgvector is fine up to ~10M vectors per index, and operational simplicity beats marginal latency wins. Reach for a dedicated vector DB when you actually measure pgvector falling over, not before.

The Background Extractor

This is the function that turns "stuff that happened in the conversation" into structured memory. We run it after the response is streamed.

```python from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate

extractor_llm = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0) extractor = extractor_llm.with_structured_output(SemanticFact)

EXTRACT_PROMPT = ChatPromptTemplate.from_messages([ ("system", "Extract at most ONE durable, non-trivial fact about the user from this turn. " "Skip greetings, fillers, and ephemeral context. Return null if nothing qualifies."), ("human", "User said: {user_msg}\nAgent replied: {agent_msg}"), ])

async def write_memory(turn, store, namespace, run_id): fact = await (EXTRACT_PROMPT | extractor).ainvoke({ "user_msg": turn.user, "agent_msg": turn.agent, }) if fact is None: return

# Dedup by (subject, predicate)
existing = await store.asearch(
    namespace,
    query=f"{fact.subject} {fact.predicate}",
    limit=3,
)
if any(_same_relation(e.value, fact) for e in existing):
    return  # already known — could update confidence, but skip for now

fact.source_run_id = run_id
fact.created_at = datetime.utcnow()
await store.aput(namespace, key=f"{fact.subject}:{fact.predicate}", value=fact.model_dump(mode="json"))

```

A few production notes:

Use a cheap model for extraction. `gpt-4o-mini-2024-07-18` is plenty for "is there a fact in this turn?" Spending `gpt-4o` budget here is a waste.
Cap to one fact per turn. Otherwise the extractor pads and you get junk memories.
Dedup before write. A semantic search against the new `(subject, predicate)` pair is cheap and catches 90% of duplicates.

Eviction and Summarization — Cost Predictability

Memory grows. If you never delete, retrieval quality degrades, latency creeps, and your DB bill climbs. We use three eviction policies in combination:

TTL on episodic events. A complaint from 18 months ago is rarely relevant. We expire `EpisodicEvent` records after 365 days unless explicitly pinned.
Supersession for semantic facts. When a new `(subject, predicate)` arrives that contradicts an old one, we set `superseded_by` on the old record and prefer the new one in retrieval. We keep the old record for audit, not for inference.
Compaction. Once per user per week, a background job summarizes the top 50 episodic events into 5–10 compressed records, then deletes the originals. Token footprint drops about 6x with minimal loss of recall on our internal eval.

The combination keeps per-user storage roughly bounded around 200–400 records, regardless of how chatty the user is.

What This Buys You

On our voice agent benchmark — 100 users, 12-turn conversations, fact recall test at turn 10 — switching from a naive "stuff the last 20 messages into the prompt" baseline to the structured store improved recall@5 from 0.61 to 0.92, dropped p50 turn latency from 1,840 ms to 1,420 ms (because the prompt got shorter, not longer), and cut output token cost per turn by 38%. The trick is that good memory is not "remember more" — it is "retrieve only what matters and keep the prompt small."

If you want the eval methodology that produced those numbers, jump to the memory eval pipeline post — memory you cannot measure is memory you cannot trust. And if you are building voice or chat agents in production, this two-track pattern (checkpointer + store) is the single biggest architectural bet that pays off in month three, not week one.

What to Build First

Do not try to ship semantic, episodic, and procedural memory in one PR. The order that works:

Get the checkpointer right. Re-read the checkpointer post if needed.
Add `PostgresStore` with one namespace and one Pydantic schema for semantic facts.
Wire the background extractor with a cheap model and aggressive dedup.
Add an eval (recall@k against a golden multi-turn dataset) before you tune anything.
Only then add episodic events, supersession, and compaction.

Ship in that order and you will have a memory system that gets better over time instead of one that quietly accumulates contradictions until someone notices.

Agent Memory in LangGraph 2026: Short-Term, Long-Term, and the Patterns That Survive Production

TL;DR

Two Kinds of Memory, and Why People Keep Conflating Them

The LangGraph `BaseStore` Interface

pip install langgraph==0.2.55 langgraph-checkpoint-postgres==2.0.13

Namespace is a tuple — convention is (user_id, memory_type)

Write

Point read

Semantic search across the namespace

Delete

The Read → Reason → Write Cycle

Schema Design — The Part That Actually Matters

Wiring an Embedding Index Into `PostgresStore`

Backend Comparison

The Background Extractor

Eviction and Summarization — Cost Predictability

What This Buys You

What to Build First

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Build a Golden Dataset for Production AI Agents

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split