By Sagar Shankaran, Founder of CallSphere
How short-term (thread-scoped) and long-term (cross-thread) memory actually work in LangGraph, with code, schemas, and the eviction policies that keep cost predictable.
Key takeaways
Most "memory" code I see in agent repos is one of two things: a list of past messages crammed back into the prompt every turn (which is not memory, that is a sliding window), or a vector store with no schema, no namespace, and no eviction (which is not memory either, that is a swamp). LangGraph in 2026 separates these cleanly: a checkpointer for short-term, thread-scoped state, and a store (`BaseStore`) for long-term, cross-thread memory with namespaces, semantic search, and explicit write semantics. This post is about the long-term half — how to design schemas with Pydantic, when to use `InMemoryStore` vs `PostgresStore`, how to wire embedding-based retrieval into `asearch`, and how to run the write step as a background extractor so it does not blow up your turn latency. With the right structure, a memory-augmented agent on `gpt-4o-2024-11-20` adds about 80–140 ms p50 to a turn and recovers facts at recall@5 of 0.92 on our internal benchmark; without structure, you get unbounded token growth and contradictions within a week.
Short-term memory is the conversation you are currently having. It dies (or should die) when the user closes the tab. LangGraph models it as thread-scoped state persisted by a checkpointer keyed on `thread_id`. We covered durable resumption, interrupts, and time-travel for that layer in our LangGraph checkpointer post — read it if you have not, because the rest of this post assumes you understand that checkpointers are not the right place to store "the user's spouse's name."
Long-term memory is everything that should outlive the thread: facts ("the user's preferred name is Sam"), episodes ("on April 12 the user complained about double-billing"), procedures ("when this user asks for an appointment, always check the Tuesday block first"). These belong in a separate substrate — LangGraph calls it a `store` — that is keyed by a namespace (typically `(user_id, memory_type)` or `(org_id, agent_id)`) rather than a thread.
The taxonomy we use in production, borrowed loosely from cognitive science:
| Memory type | What it stores | Example | Write trigger |
|---|---|---|---|
| Semantic | Stable facts about entities | "User Sam works at Acme, prefers Tuesdays" | Post-turn extractor finds a fact |
| Episodic | Time-stamped events | "On 2026-04-12, user reported billing bug INC-2841" | Significant event detected |
| Procedural | Learned skills / behaviors | "For this user, always confirm address before booking" | Reinforcement from feedback |
Conflating these three into one bucket is the single most common reason memory systems get worse over time. A user's current preference (semantic) and a year-old event (episodic) should not compete for the same retrieval slot.
The store API is small on purpose. The four methods that matter:
```python
from langgraph.store.base import BaseStore from langgraph.store.memory import InMemoryStore from langgraph.store.postgres import PostgresStore
namespace = ("user-7421", "semantic")
await store.aput( namespace, key="preferred_name", value={"fact": "User prefers to be called Sam, not Samantha", "source": "turn-12"}, )
record = await store.aget(namespace, key="preferred_name")
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
hits = await store.asearch( namespace, query="what does the user like to be called?", limit=5, )
await store.adelete(namespace, key="preferred_name") ```
The semantic search is the part teams misuse most. `asearch` only works if you configured the store with an embedding index at construction time — `InMemoryStore()` with no config does prefix matching, not vector retrieval. You have to explicitly opt in.
Here is the loop every memory-aware agent runs per turn. Pin it on a wall.
```mermaid flowchart TD A[User message arrives] --> B[Load short-term state from checkpointer] B --> C[asearch(namespace, query=user_msg) into long-term store] C --> D[Inject top-K memories into system prompt] D --> E[Agent reasons + calls tools + replies] E --> F[Persist short-term state via checkpointer] F --> G[Background task: extract facts from turn] G --> H{New, non-duplicate, non-contradictory?} H -->|yes| I[aput into store with provenance] H -->|no| J[Skip or update existing record] style C fill:#e0f2fe style G fill:#fef3c7 style I fill:#dcfce7 ```
Figure 1 — The two-track memory cycle. Read happens on the critical path; write happens off the critical path. This split is what keeps p50 latency bounded.
The non-obvious property: the write step runs after the response is sent. If you do extraction synchronously, your user pays for it in latency every turn. We push it onto a background task with `asyncio.create_task` (or a proper queue if you have one), and treat memory writes as eventually consistent. A turn that fails to write a memory is a missed opportunity, not a bug.
Strings as memories are a trap. They look easy, then they become impossible to deduplicate, contradict, or update. We model every memory as a Pydantic class with explicit fields, then serialize to JSON in the store.
```python from pydantic import BaseModel, Field from typing import Literal from datetime import datetime
class SemanticFact(BaseModel): """A stable fact about the user or their world.""" subject: str = Field(description="Entity the fact is about, e.g. 'user' or 'user.spouse'") predicate: str = Field(description="Relation, e.g. 'preferred_name', 'works_at'") object: str = Field(description="The value of the relation") confidence: float = Field(ge=0.0, le=1.0) source_run_id: str created_at: datetime superseded_by: str | None = None # key of newer fact, if any
class EpisodicEvent(BaseModel): """Something that happened, with a timestamp.""" event_type: Literal["complaint", "purchase", "appointment", "feedback"] summary: str = Field(max_length=400) occurred_at: datetime related_entities: list[str] = [] source_run_id: str ```
Two design choices that pay off:
For anything past a toy, use Postgres. `InMemoryStore` is for tests and notebooks. The Postgres store uses pgvector under the hood for `asearch`.
```python from langgraph.store.postgres import PostgresStore from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small") # 1536 dims, $0.02/1M tokens
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
store = PostgresStore.from_conn_string( conn_string="postgresql://...", index={ "dims": 1536, "embed": embeddings, "fields": ["summary", "object"], # which Pydantic fields to embed }, ) await store.setup() # creates tables + pgvector index ```
`fields` is the parameter people miss. By default the store would embed the entire JSON blob, which is wasteful and noisy. You almost always want to embed just the human-meaningful text fields, not metadata like timestamps or run IDs.
| Backend | Use case | Persistence | Vector search | Cost signal |
|---|---|---|---|---|
| `InMemoryStore` (no index) | Unit tests | Process-local | Prefix only | Free |
| `InMemoryStore` (w/ embeddings) | Notebook prototyping | Process-local | Yes | Embedding API only |
| `PostgresStore` | Production default | Durable | Yes (pgvector) | DB + embeddings |
| `RedisStore` (community) | Low-latency, ephemeral-ish | TTL-based | Via RediSearch | Memory-bound |
| Custom (`BaseStore` subclass) | Existing vector DB | Depends | Yes | Whatever you wire |
For most teams the answer is `PostgresStore` — you already have Postgres, pgvector is fine up to ~10M vectors per index, and operational simplicity beats marginal latency wins. Reach for a dedicated vector DB when you actually measure pgvector falling over, not before.
This is the function that turns "stuff that happened in the conversation" into structured memory. We run it after the response is streamed.
```python from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate
extractor_llm = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0) extractor = extractor_llm.with_structured_output(SemanticFact)
EXTRACT_PROMPT = ChatPromptTemplate.from_messages([ ("system", "Extract at most ONE durable, non-trivial fact about the user from this turn. " "Skip greetings, fillers, and ephemeral context. Return null if nothing qualifies."), ("human", "User said: {user_msg}\nAgent replied: {agent_msg}"), ])
async def write_memory(turn, store, namespace, run_id): fact = await (EXTRACT_PROMPT | extractor).ainvoke({ "user_msg": turn.user, "agent_msg": turn.agent, }) if fact is None: return
# Dedup by (subject, predicate)
existing = await store.asearch(
namespace,
query=f"{fact.subject} {fact.predicate}",
limit=3,
)
if any(_same_relation(e.value, fact) for e in existing):
return # already known — could update confidence, but skip for now
fact.source_run_id = run_id
fact.created_at = datetime.utcnow()
await store.aput(namespace, key=f"{fact.subject}:{fact.predicate}", value=fact.model_dump(mode="json"))
```
A few production notes:
Memory grows. If you never delete, retrieval quality degrades, latency creeps, and your DB bill climbs. We use three eviction policies in combination:
The combination keeps per-user storage roughly bounded around 200–400 records, regardless of how chatty the user is.
On our voice agent benchmark — 100 users, 12-turn conversations, fact recall test at turn 10 — switching from a naive "stuff the last 20 messages into the prompt" baseline to the structured store improved recall@5 from 0.61 to 0.92, dropped p50 turn latency from 1,840 ms to 1,420 ms (because the prompt got shorter, not longer), and cut output token cost per turn by 38%. The trick is that good memory is not "remember more" — it is "retrieve only what matters and keep the prompt small."
If you want the eval methodology that produced those numbers, jump to the memory eval pipeline post — memory you cannot measure is memory you cannot trust. And if you are building voice or chat agents in production, this two-track pattern (checkpointer + store) is the single biggest architectural bet that pays off in month three, not week one.
Do not try to ship semantic, episodic, and procedural memory in one PR. The order that works:
Ship in that order and you will have a memory system that gets better over time instead of one that quietly accumulates contradictions until someone notices.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI