Hierarchical Memory for AI Agents: Working Memory, Short-Term, and Long-Term Tiers

Why a Single Memory Store Falls Short

Most agent frameworks treat memory as a flat list. Every fact, observation, and user message lives in one undifferentiated pool. This works for toy demos, but in production the agent slows down as the memory grows, retrieval quality degrades, and context windows overflow with irrelevant details.

Human cognition solves this with hierarchical memory. Working memory holds the immediate task context. Short-term memory retains recent interactions. Long-term memory stores consolidated knowledge built up over days and weeks. An AI agent benefits from the same layered approach.

The Three-Tier Model

The hierarchy consists of three tiers, each with distinct capacity, retention, and retrieval characteristics.

flowchart TD
    MSG(["New message"])
    WORKING["Working memory<br/>rolling window"]
    EPISODIC[("Episodic memory<br/>past sessions")]
    SEMANTIC[("Semantic memory<br/>facts and preferences")]
    SUM["Summarizer<br/>compresses old turns"]
    ROUTER{"Retrieve<br/>needed memories"}
    PROMPT["Assembled context"]
    LLM["LLM"]
    UPD["Memory updater<br/>writes new facts"]
    MSG --> WORKING --> ROUTER
    ROUTER -->|Past sessions| EPISODIC
    ROUTER -->|User facts| SEMANTIC
    EPISODIC --> SUM --> PROMPT
    SEMANTIC --> PROMPT
    WORKING --> PROMPT --> LLM --> UPD
    UPD --> EPISODIC
    UPD --> SEMANTIC
    style ROUTER fill:#4f46e5,stroke:#4338ca,color:#fff
    style LLM fill:#f59e0b,stroke:#d97706,color:#1f2937
    style EPISODIC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style SEMANTIC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b

Working Memory holds the current task context. It is small, fast, and completely replaced when the agent switches tasks. Think of it as the agent's scratchpad.

Short-Term Memory retains recent conversation turns and observations. It has a fixed window size and uses a FIFO eviction policy. Items that prove important get promoted to long-term storage.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Long-Term Memory stores consolidated facts, user preferences, and learned patterns. It persists across sessions and uses semantic search for retrieval.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
from collections import deque

@dataclass
class MemoryItem:
    content: str
    timestamp: datetime
    importance: float = 0.5
    access_count: int = 0
    metadata: dict = field(default_factory=dict)

class HierarchicalMemory:
    def __init__(
        self,
        working_capacity: int = 5,
        short_term_capacity: int = 50,
    ):
        self.working: list[MemoryItem] = []
        self.short_term: deque[MemoryItem] = deque(
            maxlen=short_term_capacity
        )
        self.long_term: list[MemoryItem] = []
        self.working_capacity = working_capacity
        self.promotion_threshold = 0.7

    def add_to_working(self, content: str, importance: float = 0.5):
        item = MemoryItem(
            content=content,
            timestamp=datetime.now(),
            importance=importance,
        )
        self.working.append(item)
        if len(self.working) > self.working_capacity:
            evicted = self.working.pop(0)
            self.short_term.append(evicted)

    def promote_to_long_term(self, item: MemoryItem):
        """Promote important short-term memories."""
        if item.importance >= self.promotion_threshold:
            self.long_term.append(item)
            return True
        return False

    def sweep_short_term(self):
        """Review short-term memories for promotion."""
        promoted = []
        remaining = deque(maxlen=self.short_term.maxlen)
        for item in self.short_term:
            if self.promote_to_long_term(item):
                promoted.append(item)
            else:
                remaining.append(item)
        self.short_term = remaining
        return promoted

Promotion Rules

Promotion from short-term to long-term should not be arbitrary. Three signals determine whether a memory deserves long-term storage.

Importance score — memories tagged with high importance during creation (user preferences, explicit instructions) are promoted immediately.

Access frequency — if the agent retrieves a short-term memory multiple times, it is clearly useful and should be promoted.

Recency-weighted relevance — memories that remain relevant after multiple conversation turns have proven their staying power.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

def should_promote(self, item: MemoryItem) -> bool:
    importance_signal = item.importance >= self.promotion_threshold
    access_signal = item.access_count >= 3
    age_seconds = (datetime.now() - item.timestamp).total_seconds()
    survived_long = age_seconds > 300 and item.access_count > 0
    return importance_signal or access_signal or survived_long

Eviction Policies

Each tier needs a different eviction strategy. Working memory uses strict replacement — when a new task begins, the entire working memory is flushed. Short-term memory uses FIFO with a promotion check: before an item is evicted, the system evaluates whether it should be promoted. Long-term memory uses importance-decay eviction — items that have not been accessed in a long time and have low importance are candidates for removal.

def evict_long_term(self, max_items: int = 1000):
    if len(self.long_term) <= max_items:
        return
    self.long_term.sort(
        key=lambda m: m.importance * (m.access_count + 1),
        reverse=True,
    )
    self.long_term = self.long_term[:max_items]

Retrieval Priority

When the agent needs to recall information, it searches the tiers in order: working memory first (exact match, no embedding needed), then short-term (recency-weighted), then long-term (semantic search). This mirrors the human pattern where recent, immediately relevant memories surface first.

def retrieve(self, query: str, top_k: int = 5) -> list[MemoryItem]:
    results = []
    # Tier 1: working memory — exact substring match
    for item in self.working:
        if query.lower() in item.content.lower():
            item.access_count += 1
            results.append(item)

    # Tier 2: short-term — recency bias
    for item in sorted(
        self.short_term,
        key=lambda m: m.timestamp,
        reverse=True,
    ):
        if query.lower() in item.content.lower():
            item.access_count += 1
            results.append(item)

    # Tier 3: long-term — would use embedding similarity
    # in production; simplified here for clarity
    for item in self.long_term:
        if query.lower() in item.content.lower():
            item.access_count += 1
            results.append(item)

    return results[:top_k]

FAQ

Why not just use a vector database for everything?

A vector database is excellent for long-term semantic retrieval, but it adds latency. Working memory and short-term memory benefit from in-process data structures that return results in microseconds. The hierarchical approach lets you use the right storage engine for each tier.

How do I decide the capacity for each tier?

Working memory should match the context needed for a single task — typically 3 to 10 items. Short-term memory should cover a full conversation session, usually 30 to 100 items. Long-term capacity depends on your storage budget, but start with 1,000 items and add eviction when you exceed it.

Can I persist all three tiers across agent restarts?

Working memory is ephemeral by design and should be rebuilt from the current task state. Short-term memory can be serialized to a session store like Redis with a TTL. Long-term memory should always be persisted to a database or vector store.

#AgentMemory #MemoryArchitecture #WorkingMemory #Python #AgenticAI #LearnAI #AIEngineering

Hierarchical Memory for AI Agents: Working Memory, Short-Term, and Long-Term Tiers

Why a Single Memory Store Falls Short

The Three-Tier Model

Promotion Rules

Eviction Policies

Retrieval Priority

FAQ

Why not just use a vector database for everything?

How do I decide the capacity for each tier?

Can I persist all three tiers across agent restarts?

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Long-Running Agent Workflows: The 2026 Enterprise Blueprint

Evaluating Agent Memory: Recall, Precision, and the Eval Pipeline Most Teams Don't Build

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Agent Memory in LangGraph 2026: Short-Term, Long-Term, and the Patterns That Survive Production

Anthropic Skills System: Loadable Tool Packs for Claude Agents