Letta (formerly MemGPT) in 2026: The OS for Stateful Agent Memory
Letta treats the LLM like an OS that manages its own RAM, recall, and archival memory. Here is when this paradigm beats simple vector stores.
TL;DR — Letta (formerly MemGPT) is an "LLM-as-OS" runtime where the agent manages its own memory tiers like an operating system manages RAM and disk. If your agent needs to learn across sessions and edit its own context, Letta is the most mature option in 2026.
The mental model
flowchart TD
Client[MCP client · Claude Desktop] --> MCP[MCP server]
MCP --> Tool1[Tool: Calendar]
MCP --> Tool2[Tool: CRM]
MCP --> Tool3[Tool: KB search]
Tool1 --> SaaS1[(Calendly)]
Tool2 --> SaaS2[(Salesforce)]
Tool3 --> SaaS3[(Notion)]Traditional LLM apps treat memory as something the application layer fetches and stuffs into the prompt. Letta inverts that: the agent decides what to keep in context, what to push to recall, what to archive. The model has tools to read and write its own memory tiers.
Three tiers:
- Core Memory — a small block that lives in the context window, like RAM. The agent reads and writes it directly each turn. Holds the agent's persona and the most important facts about the user.
- Recall Memory — searchable conversation history outside context, like a disk cache. The agent queries it via tool calls when needed.
- Archival Memory — long-term storage the agent queries via tool calls. Cold storage. Vector-indexed.
When the context is about to overflow, the agent receives a system message ("you are running out of context") and must decide what to evict to recall, what to summarize into core, and what to archive. This is the OS analogy made literal.
What changed in 2026
The MemGPT open-source project was absorbed into Letta. The platform now ships:
- Letta Code — a memory-first coding agent that ranks #1 on the Terminal-Bench leaderboard for model-agnostic OSS coding agents.
- Conversations API — agents share memory across parallel user experiences.
- A rearchitected agent loop that draws lessons from ReAct, MemGPT, and Claude Code, with cleaner tool dispatch and better long-running task handling.
When to pick Letta
Pick Letta when:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Your agent must remember things across sessions without an external app layer fetching memory.
- You want the agent to edit its own persona and facts as it learns about the user.
- You need a first-class agent runtime, not just a memory bolt-on.
- You're building an assistant that runs for days, weeks, or indefinitely.
Skip Letta when:
- Your workflow is stateless (one-shot tool calls).
- You only need a vector store with metadata — that's simpler and cheaper.
- You're already deeply invested in another agent framework and just need a memory plugin (use mem0 or Zep instead).
How CallSphere thinks about this
CallSphere's voice agents are mostly session-bounded — a single inbound or outbound call is the unit of work. We don't need Letta for that.
But our after-hours product (7 agents with explicit escalation) is exactly Letta-shaped. When a customer's caretaker calls at 11 PM about a recurring issue, the agent benefits from remembering the prior week's escalations, the family member's preferences, the on-call doctor's instructions. That state lives in our Postgres today; we've prototyped a Letta-backed version that lets the agent edit its own "what I know about this household" core memory after each call.
For our Real Estate OneRoof deployment (10 specialist agents), the buyer-journey use case is similar — a buyer searches for 6 months, talks to the agent dozens of times, and the agent should learn their preferences over that span. That's a Letta-shaped problem.
Pricing: $149 / $499 / $1499. 14-day trial. 22% affiliate.
Build steps — your first Letta agent
pip install lettaor run the Letta server:docker run -p 8283:8283 letta/letta.- Create an agent with a persona and human profile in core memory.
- Add tools for whatever the agent does (DB queries, web search, internal APIs).
- Connect via the Letta SDK from your application.
- Send messages; the agent's core memory updates automatically as it learns.
- Inspect memory via the dashboard or
agent.memory.get(). - Persist with the Postgres-backed deployment for production.
Code: a Letta agent that learns about the user
from letta_client import Letta
client = Letta(base_url="http://localhost:8283")
agent = client.agents.create(
name="callsphere-after-hours",
memory_blocks=[
{"label": "persona", "value": "I am a calm, careful after-hours support agent."},
{"label": "human", "value": "Unknown caller. I will learn as we talk."},
],
tools=["lookup_account", "page_on_call_doctor"],
model="openai/gpt-5",
embedding="openai/text-embedding-3-large",
)
response = client.agents.messages.create(
agent_id=agent.id,
messages=[{"role": "user", "content": "It's about my mom again, the breathing thing"}],
)
# Internally the agent updated its 'human' core memory block to record the
# caller's relationship and the recurring concern. Next call benefits.
Memory tier sizing — what to put where
Sizing the three tiers correctly is the difference between a useful Letta agent and a confused one:
- Core memory should be small and curated. A few hundred tokens at most. Persona, the most important user facts, current task. Anything else competes with the user's input for space.
- Recall is your conversation log. Store everything; the agent searches when needed. Don't manually prune unless you have a privacy reason.
- Archival is for learned knowledge — distilled facts and summaries the agent generated. The agent decides what to write here through tool calls.
Letta's automatic eviction logic moves content between tiers as context fills. The default behavior is reasonable; we override it only when we want a specific fact (like "this household speaks Spanish primarily") to stay in core forever.
The "agent eats its own context" failure mode
A common Letta failure is the agent over-writing core memory until it loses sight of its persona. Mitigations:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Mark persona blocks as read-only so the agent can't accidentally overwrite them.
- Cap the size of writeable core blocks — Letta lets you set per-block max tokens.
- Run a periodic "memory health" eval — feed the agent a question it should know based on prior sessions; if it doesn't, your memory pipeline is broken.
We hit this in early prototyping; the read-only persona block fix made the agent's identity stable across hundreds of turns.
Conversations API — shared memory across users
The Conversations API lets multiple users interact with the same agent and share memory. For a household agent (mom, dad, kids all calling about the same elderly relative's care), this is exactly the model. The agent knows it's a household, not three individual relationships, and reasons accordingly.
We're prototyping this for our after-hours product where caretakers and family members both interact with the same support agent.
FAQ
Letta vs mem0 vs Zep? Letta is a runtime (the agent lives in Letta). mem0 is a memory library you wire into your existing agent. Zep is a managed memory platform with temporal knowledge graphs. Pick by where you want the agent to live.
Is Letta production-ready? Yes. The Letta Code agent ranks #1 on Terminal-Bench among OSS model-agnostic coding agents — that's a strong production signal.
Does it work with MCP? Yes — Letta agents can mount MCP servers as toolsets.
What's the licensing? Apache 2.0 for the OSS server. Letta also offers a managed cloud.
Where do I see this on CallSphere? Book a demo and we'll walk through our after-hours Letta prototype.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.