---
title: "Conversation History Management: Sliding Windows, Summarization, and Compaction"
description: "Learn the three core strategies for managing conversation history in AI agents — sliding windows, summary-based compression, and compaction — to stay within context window limits while preserving critical information."
canonical: https://callsphere.ai/blog/conversation-history-management-sliding-windows-summarization-compaction
category: "Learn Agentic AI"
tags: ["Conversation History", "Context Window", "Token Management", "LLM Memory", "Agentic AI"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:42.677Z
---

# Conversation History Management: Sliding Windows, Summarization, and Compaction

> Learn the three core strategies for managing conversation history in AI agents — sliding windows, summary-based compression, and compaction — to stay within context window limits while preserving critical information.

## Why Conversation History Management Matters

Every LLM has a finite context window. GPT-4o supports 128K tokens, Claude supports up to 200K, and many open-source models cap at 8K or 32K. When your AI agent runs a multi-turn conversation or a long-running task, the raw message history can easily exceed these limits. Without a strategy, you either truncate blindly and lose critical context, or you hit token errors and the agent crashes.

Conversation history management is the discipline of deciding what stays in the context window, what gets compressed, and what gets discarded. There are three primary strategies: sliding windows, summarization, and compaction. Each has distinct tradeoffs between simplicity, fidelity, and compute cost.

## Strategy 1: Sliding Window

The simplest approach keeps only the most recent N messages (or N tokens) and drops everything older. This works well for conversational agents where recent context matters most.

```mermaid
flowchart TD
    MSG(["New message"])
    WORKING["Working memory
rolling window"]
    EPISODIC[("Episodic memory
past sessions")]
    SEMANTIC[("Semantic memory
facts and preferences")]
    SUM["Summarizer
compresses old turns"]
    ROUTER{"Retrieve
needed memories"}
    PROMPT["Assembled context"]
    LLM["LLM"]
    UPD["Memory updater
writes new facts"]
    MSG --> WORKING --> ROUTER
    ROUTER -->|Past sessions| EPISODIC
    ROUTER -->|User facts| SEMANTIC
    EPISODIC --> SUM --> PROMPT
    SEMANTIC --> PROMPT
    WORKING --> PROMPT --> LLM --> UPD
    UPD --> EPISODIC
    UPD --> SEMANTIC
    style ROUTER fill:#4f46e5,stroke:#4338ca,color:#fff
    style LLM fill:#f59e0b,stroke:#d97706,color:#1f2937
    style EPISODIC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style SEMANTIC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
```

```python
from typing import List, Dict

def sliding_window(
    messages: List[Dict[str, str]],
    max_tokens: int = 4000,
    token_counter=None
) -> List[Dict[str, str]]:
    """Keep the system message and the most recent messages that fit."""
    if token_counter is None:
        token_counter = lambda msg: len(msg["content"]) // 4  # rough estimate

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    system_tokens = sum(token_counter(m) for m in system_msgs)
    budget = max_tokens - system_tokens
    kept = []
    running = 0

    for msg in reversed(non_system):
        cost = token_counter(msg)
        if running + cost > budget:
            break
        kept.append(msg)
        running += cost

    return system_msgs + list(reversed(kept))
```

The sliding window is fast and predictable, but it has a major flaw: once a message scrolls out of the window, the agent forgets it entirely. If a user stated their name or a key requirement 50 messages ago, that information vanishes.

## Strategy 2: Summarization

Summarization compresses older history into a shorter summary that preserves key facts while reducing token count. You periodically call the LLM to summarize the oldest portion of the conversation, then replace those messages with the summary.

```python
import openai

async def summarize_history(
    messages: List[Dict[str, str]],
    threshold: int = 3000,
    keep_recent: int = 10,
    token_counter=None
) -> List[Dict[str, str]]:
    """Summarize old messages when total tokens exceed threshold."""
    if token_counter is None:
        token_counter = lambda msg: len(msg["content"]) // 4

    total = sum(token_counter(m) for m in messages)
    if total  List[Dict[str, str]]:
        if len(self.messages) > self.window_size:
            overflow = self.messages[:-self.window_size]
            self.messages = self.messages[-self.window_size:]
            await self._update_summary(overflow)

        context = [{"role": "system", "content": system_prompt}]
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Context from earlier: {self.summary}",
            })
        context.extend(self.messages)
        return context

    async def _update_summary(self, new_messages):
        new_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in new_messages
        )
        client = openai.AsyncOpenAI()
        resp = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": (
                    f"Existing summary: {self.summary}\n\n"
                    f"New messages to incorporate:\n{new_text}\n\n"
                    "Produce an updated summary preserving all key facts."
                ),
            }],
            max_tokens=400,
        )
        self.summary = resp.choices[0].message.content
```

## Choosing the Right Strategy

| Strategy | Complexity | Long-Range Memory | Extra LLM Calls | Best For |
| --- | --- | --- | --- | --- |
| Sliding Window | Low | None | Zero | Short conversations, chatbots |
| Summarization | Medium | Good | Periodic | Customer support, assistants |
| Compaction | High | Best | Incremental | Long-running agents, research tasks |

For most production agents, compaction provides the best balance. It keeps recent messages verbatim for accuracy while maintaining a compressed record of everything that came before.

## FAQ

### How do I count tokens accurately instead of estimating?

Use the `tiktoken` library for OpenAI models. Call `tiktoken.encoding_for_model("gpt-4o")` to get an encoder, then `len(encoder.encode(text))` for exact token counts. For Claude, Anthropic provides a token counting API endpoint.

### Should the system message ever be summarized?

No. The system message defines the agent's behavior and should always remain in full. Only user and assistant messages should be candidates for summarization or eviction. Treat the system prompt as immutable context.

### Can I combine sliding windows with an external memory store?

Yes, and this is a common production pattern. Use a sliding window for the immediate context, but persist all messages to a database or vector store. When the agent needs old information, it queries the external store and injects relevant results into the current context.

---

#ConversationHistory #ContextWindow #TokenManagement #LLMMemory #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/conversation-history-management-sliding-windows-summarization-compaction
