Conversation History Management: Sliding Windows, Summarization, and Compaction

Why Conversation History Management Matters

Every LLM has a finite context window. GPT-4o supports 128K tokens, Claude supports up to 200K, and many open-source models cap at 8K or 32K. When your AI agent runs a multi-turn conversation or a long-running task, the raw message history can easily exceed these limits. Without a strategy, you either truncate blindly and lose critical context, or you hit token errors and the agent crashes.

Conversation history management is the discipline of deciding what stays in the context window, what gets compressed, and what gets discarded. There are three primary strategies: sliding windows, summarization, and compaction. Each has distinct tradeoffs between simplicity, fidelity, and compute cost.

Strategy 1: Sliding Window

The simplest approach keeps only the most recent N messages (or N tokens) and drops everything older. This works well for conversational agents where recent context matters most.

flowchart TD
    MSG(["New message"])
    WORKING["Working memory<br/>rolling window"]
    EPISODIC[("Episodic memory<br/>past sessions")]
    SEMANTIC[("Semantic memory<br/>facts and preferences")]
    SUM["Summarizer<br/>compresses old turns"]
    ROUTER{"Retrieve<br/>needed memories"}
    PROMPT["Assembled context"]
    LLM["LLM"]
    UPD["Memory updater<br/>writes new facts"]
    MSG --> WORKING --> ROUTER
    ROUTER -->|Past sessions| EPISODIC
    ROUTER -->|User facts| SEMANTIC
    EPISODIC --> SUM --> PROMPT
    SEMANTIC --> PROMPT
    WORKING --> PROMPT --> LLM --> UPD
    UPD --> EPISODIC
    UPD --> SEMANTIC
    style ROUTER fill:#4f46e5,stroke:#4338ca,color:#fff
    style LLM fill:#f59e0b,stroke:#d97706,color:#1f2937
    style EPISODIC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style SEMANTIC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b

from typing import List, Dict

def sliding_window(
    messages: List[Dict[str, str]],
    max_tokens: int = 4000,
    token_counter=None
) -> List[Dict[str, str]]:
    """Keep the system message and the most recent messages that fit."""
    if token_counter is None:
        token_counter = lambda msg: len(msg["content"]) // 4  # rough estimate

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    system_tokens = sum(token_counter(m) for m in system_msgs)
    budget = max_tokens - system_tokens
    kept = []
    running = 0

    for msg in reversed(non_system):
        cost = token_counter(msg)
        if running + cost > budget:
            break
        kept.append(msg)
        running += cost

    return system_msgs + list(reversed(kept))

The sliding window is fast and predictable, but it has a major flaw: once a message scrolls out of the window, the agent forgets it entirely. If a user stated their name or a key requirement 50 messages ago, that information vanishes.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Strategy 2: Summarization

Summarization compresses older history into a shorter summary that preserves key facts while reducing token count. You periodically call the LLM to summarize the oldest portion of the conversation, then replace those messages with the summary.

import openai

async def summarize_history(
    messages: List[Dict[str, str]],
    threshold: int = 3000,
    keep_recent: int = 10,
    token_counter=None
) -> List[Dict[str, str]]:
    """Summarize old messages when total tokens exceed threshold."""
    if token_counter is None:
        token_counter = lambda msg: len(msg["content"]) // 4

    total = sum(token_counter(m) for m in messages)
    if total <= threshold:
        return messages

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    old_messages = non_system[:-keep_recent]
    recent_messages = non_system[-keep_recent:]

    old_text = "\n".join(
        f"{m['role']}: {m['content']}" for m in old_messages
    )

    client = openai.AsyncOpenAI()
    summary_response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                "Summarize this conversation history. Preserve all key facts, "
                "decisions, user preferences, and action items:\n\n"
                f"{old_text}"
            ),
        }],
        max_tokens=500,
    )

    summary = summary_response.choices[0].message.content
    summary_msg = {
        "role": "system",
        "content": f"Summary of earlier conversation:\n{summary}",
    }

    return system_msgs + [summary_msg] + recent_messages

Summarization preserves long-range context at the cost of an extra LLM call and potential information loss during compression.

Strategy 3: Compaction (Hybrid)

Compaction combines both approaches. It maintains a rolling summary that gets updated incrementally as messages age out of the sliding window. Each time the window shifts, new messages are merged into the existing summary rather than re-summarizing the entire history.

class CompactionManager:
    def __init__(self, window_size: int = 20, summary: str = ""):
        self.window_size = window_size
        self.summary = summary
        self.messages: List[Dict[str, str]] = []

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    async def get_context(self, system_prompt: str) -> List[Dict[str, str]]:
        if len(self.messages) > self.window_size:
            overflow = self.messages[:-self.window_size]
            self.messages = self.messages[-self.window_size:]
            await self._update_summary(overflow)

        context = [{"role": "system", "content": system_prompt}]
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Context from earlier: {self.summary}",
            })
        context.extend(self.messages)
        return context

    async def _update_summary(self, new_messages):
        new_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in new_messages
        )
        client = openai.AsyncOpenAI()
        resp = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": (
                    f"Existing summary: {self.summary}\n\n"
                    f"New messages to incorporate:\n{new_text}\n\n"
                    "Produce an updated summary preserving all key facts."
                ),
            }],
            max_tokens=400,
        )
        self.summary = resp.choices[0].message.content

Choosing the Right Strategy

Strategy	Complexity	Long-Range Memory	Extra LLM Calls	Best For
Sliding Window	Low	None	Zero	Short conversations, chatbots
Summarization	Medium	Good	Periodic	Customer support, assistants
Compaction	High	Best	Incremental	Long-running agents, research tasks

For most production agents, compaction provides the best balance. It keeps recent messages verbatim for accuracy while maintaining a compressed record of everything that came before.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

How do I count tokens accurately instead of estimating?

Use the tiktoken library for OpenAI models. Call tiktoken.encoding_for_model("gpt-4o") to get an encoder, then len(encoder.encode(text)) for exact token counts. For Claude, Anthropic provides a token counting API endpoint.

Should the system message ever be summarized?

No. The system message defines the agent's behavior and should always remain in full. Only user and assistant messages should be candidates for summarization or eviction. Treat the system prompt as immutable context.

Can I combine sliding windows with an external memory store?

Yes, and this is a common production pattern. Use a sliding window for the immediate context, but persist all messages to a database or vector store. When the agent needs old information, it queries the external store and injects relevant results into the current context.

#ConversationHistory #ContextWindow #TokenManagement #LLMMemory #AgenticAI #LearnAI #AIEngineering

Conversation History Management: Sliding Windows, Summarization, and Compaction

Why Conversation History Management Matters

Strategy 1: Sliding Window

Strategy 2: Summarization

Strategy 3: Compaction (Hybrid)

Choosing the Right Strategy

FAQ

How do I count tokens accurately instead of estimating?

Should the system message ever be summarized?

Can I combine sliding windows with an external memory store?

Try CallSphere AI Voice Agents

Related Articles You May Like

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Anthropic Skills System: Loadable Tool Packs for Claude Agents

Enterprise CIO Guide: Harvey AI — Legal Agents Move from Pilot to Practice

Enterprise CIO Guide: Perplexity Comet — The Agentic Browser Goes Mass Market

Enterprise CIO Guide: Hippocratic AI — Healthcare Agents at Scale