Skip to content
Learn Agentic AI
Learn Agentic AI9 min read2 views

Hierarchical Memory for AI Agents: Working Memory, Short-Term, and Long-Term Tiers

Learn how to design a three-tier memory architecture for AI agents with working memory, short-term buffers, and long-term stores, including promotion rules, eviction policies, and retrieval priority.

Why a Single Memory Store Falls Short

Most agent frameworks treat memory as a flat list. Every fact, observation, and user message lives in one undifferentiated pool. This works for toy demos, but in production the agent slows down as the memory grows, retrieval quality degrades, and context windows overflow with irrelevant details.

Human cognition solves this with hierarchical memory. Working memory holds the immediate task context. Short-term memory retains recent interactions. Long-term memory stores consolidated knowledge built up over days and weeks. An AI agent benefits from the same layered approach.

The Three-Tier Model

The hierarchy consists of three tiers, each with distinct capacity, retention, and retrieval characteristics.

flowchart TD
    START["Hierarchical Memory for AI Agents: Working Memory…"] --> A
    A["Why a Single Memory Store Falls Short"]
    A --> B
    B["The Three-Tier Model"]
    B --> C
    C["Promotion Rules"]
    C --> D
    D["Eviction Policies"]
    D --> E
    E["Retrieval Priority"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Working Memory holds the current task context. It is small, fast, and completely replaced when the agent switches tasks. Think of it as the agent's scratchpad.

Short-Term Memory retains recent conversation turns and observations. It has a fixed window size and uses a FIFO eviction policy. Items that prove important get promoted to long-term storage.

Long-Term Memory stores consolidated facts, user preferences, and learned patterns. It persists across sessions and uses semantic search for retrieval.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
from collections import deque


@dataclass
class MemoryItem:
    content: str
    timestamp: datetime
    importance: float = 0.5
    access_count: int = 0
    metadata: dict = field(default_factory=dict)


class HierarchicalMemory:
    def __init__(
        self,
        working_capacity: int = 5,
        short_term_capacity: int = 50,
    ):
        self.working: list[MemoryItem] = []
        self.short_term: deque[MemoryItem] = deque(
            maxlen=short_term_capacity
        )
        self.long_term: list[MemoryItem] = []
        self.working_capacity = working_capacity
        self.promotion_threshold = 0.7

    def add_to_working(self, content: str, importance: float = 0.5):
        item = MemoryItem(
            content=content,
            timestamp=datetime.now(),
            importance=importance,
        )
        self.working.append(item)
        if len(self.working) > self.working_capacity:
            evicted = self.working.pop(0)
            self.short_term.append(evicted)

    def promote_to_long_term(self, item: MemoryItem):
        """Promote important short-term memories."""
        if item.importance >= self.promotion_threshold:
            self.long_term.append(item)
            return True
        return False

    def sweep_short_term(self):
        """Review short-term memories for promotion."""
        promoted = []
        remaining = deque(maxlen=self.short_term.maxlen)
        for item in self.short_term:
            if self.promote_to_long_term(item):
                promoted.append(item)
            else:
                remaining.append(item)
        self.short_term = remaining
        return promoted

Promotion Rules

Promotion from short-term to long-term should not be arbitrary. Three signals determine whether a memory deserves long-term storage.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Importance score — memories tagged with high importance during creation (user preferences, explicit instructions) are promoted immediately.

Access frequency — if the agent retrieves a short-term memory multiple times, it is clearly useful and should be promoted.

Recency-weighted relevance — memories that remain relevant after multiple conversation turns have proven their staying power.

def should_promote(self, item: MemoryItem) -> bool:
    importance_signal = item.importance >= self.promotion_threshold
    access_signal = item.access_count >= 3
    age_seconds = (datetime.now() - item.timestamp).total_seconds()
    survived_long = age_seconds > 300 and item.access_count > 0
    return importance_signal or access_signal or survived_long

Eviction Policies

Each tier needs a different eviction strategy. Working memory uses strict replacement — when a new task begins, the entire working memory is flushed. Short-term memory uses FIFO with a promotion check: before an item is evicted, the system evaluates whether it should be promoted. Long-term memory uses importance-decay eviction — items that have not been accessed in a long time and have low importance are candidates for removal.

def evict_long_term(self, max_items: int = 1000):
    if len(self.long_term) <= max_items:
        return
    self.long_term.sort(
        key=lambda m: m.importance * (m.access_count + 1),
        reverse=True,
    )
    self.long_term = self.long_term[:max_items]

Retrieval Priority

When the agent needs to recall information, it searches the tiers in order: working memory first (exact match, no embedding needed), then short-term (recency-weighted), then long-term (semantic search). This mirrors the human pattern where recent, immediately relevant memories surface first.

def retrieve(self, query: str, top_k: int = 5) -> list[MemoryItem]:
    results = []
    # Tier 1: working memory — exact substring match
    for item in self.working:
        if query.lower() in item.content.lower():
            item.access_count += 1
            results.append(item)

    # Tier 2: short-term — recency bias
    for item in sorted(
        self.short_term,
        key=lambda m: m.timestamp,
        reverse=True,
    ):
        if query.lower() in item.content.lower():
            item.access_count += 1
            results.append(item)

    # Tier 3: long-term — would use embedding similarity
    # in production; simplified here for clarity
    for item in self.long_term:
        if query.lower() in item.content.lower():
            item.access_count += 1
            results.append(item)

    return results[:top_k]

FAQ

Why not just use a vector database for everything?

A vector database is excellent for long-term semantic retrieval, but it adds latency. Working memory and short-term memory benefit from in-process data structures that return results in microseconds. The hierarchical approach lets you use the right storage engine for each tier.

How do I decide the capacity for each tier?

Working memory should match the context needed for a single task — typically 3 to 10 items. Short-term memory should cover a full conversation session, usually 30 to 100 items. Long-term capacity depends on your storage budget, but start with 1,000 items and add eviction when you exceed it.

Can I persist all three tiers across agent restarts?

Working memory is ephemeral by design and should be rebuilt from the current task state. Short-term memory can be serialized to a session store like Redis with a TTL. Long-term memory should always be persisted to a database or vector store.


#AgentMemory #MemoryArchitecture #WorkingMemory #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.

AI Interview Prep

7 Agentic AI & Multi-Agent System Interview Questions for 2026

Real agentic AI and multi-agent system interview questions from Anthropic, OpenAI, and Microsoft in 2026. Covers agent design patterns, memory systems, safety, orchestration frameworks, tool calling, and evaluation.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Adaptive Thinking in Claude 4.6: How AI Agents Decide When and How Much to Reason

Technical exploration of adaptive thinking in Claude 4.6 — how the model dynamically adjusts reasoning depth, its impact on agent architectures, and practical implementation patterns.

Learn Agentic AI

Agent Memory Systems: Short-Term, Long-Term, and Episodic Memory for AI Agents

Technical deep dive into agent memory architectures covering conversation context, vector DB persistence, and experience replay with implementation code for production systems.