Skip to content
Learn Agentic AI
Learn Agentic AI11 min read5 views

Conversation History Management: Sliding Windows, Summarization, and Compaction

Learn the three core strategies for managing conversation history in AI agents — sliding windows, summary-based compression, and compaction — to stay within context window limits while preserving critical information.

Why Conversation History Management Matters

Every LLM has a finite context window. GPT-4o supports 128K tokens, Claude supports up to 200K, and many open-source models cap at 8K or 32K. When your AI agent runs a multi-turn conversation or a long-running task, the raw message history can easily exceed these limits. Without a strategy, you either truncate blindly and lose critical context, or you hit token errors and the agent crashes.

Conversation history management is the discipline of deciding what stays in the context window, what gets compressed, and what gets discarded. There are three primary strategies: sliding windows, summarization, and compaction. Each has distinct tradeoffs between simplicity, fidelity, and compute cost.

Strategy 1: Sliding Window

The simplest approach keeps only the most recent N messages (or N tokens) and drops everything older. This works well for conversational agents where recent context matters most.

flowchart TD
    START["Conversation History Management: Sliding Windows,…"] --> A
    A["Why Conversation History Management Mat…"]
    A --> B
    B["Strategy 1: Sliding Window"]
    B --> C
    C["Strategy 2: Summarization"]
    C --> D
    D["Strategy 3: Compaction Hybrid"]
    D --> E
    E["Choosing the Right Strategy"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from typing import List, Dict

def sliding_window(
    messages: List[Dict[str, str]],
    max_tokens: int = 4000,
    token_counter=None
) -> List[Dict[str, str]]:
    """Keep the system message and the most recent messages that fit."""
    if token_counter is None:
        token_counter = lambda msg: len(msg["content"]) // 4  # rough estimate

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    system_tokens = sum(token_counter(m) for m in system_msgs)
    budget = max_tokens - system_tokens
    kept = []
    running = 0

    for msg in reversed(non_system):
        cost = token_counter(msg)
        if running + cost > budget:
            break
        kept.append(msg)
        running += cost

    return system_msgs + list(reversed(kept))

The sliding window is fast and predictable, but it has a major flaw: once a message scrolls out of the window, the agent forgets it entirely. If a user stated their name or a key requirement 50 messages ago, that information vanishes.

Strategy 2: Summarization

Summarization compresses older history into a shorter summary that preserves key facts while reducing token count. You periodically call the LLM to summarize the oldest portion of the conversation, then replace those messages with the summary.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import openai

async def summarize_history(
    messages: List[Dict[str, str]],
    threshold: int = 3000,
    keep_recent: int = 10,
    token_counter=None
) -> List[Dict[str, str]]:
    """Summarize old messages when total tokens exceed threshold."""
    if token_counter is None:
        token_counter = lambda msg: len(msg["content"]) // 4

    total = sum(token_counter(m) for m in messages)
    if total <= threshold:
        return messages

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    old_messages = non_system[:-keep_recent]
    recent_messages = non_system[-keep_recent:]

    old_text = "\n".join(
        f"{m['role']}: {m['content']}" for m in old_messages
    )

    client = openai.AsyncOpenAI()
    summary_response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                "Summarize this conversation history. Preserve all key facts, "
                "decisions, user preferences, and action items:\n\n"
                f"{old_text}"
            ),
        }],
        max_tokens=500,
    )

    summary = summary_response.choices[0].message.content
    summary_msg = {
        "role": "system",
        "content": f"Summary of earlier conversation:\n{summary}",
    }

    return system_msgs + [summary_msg] + recent_messages

Summarization preserves long-range context at the cost of an extra LLM call and potential information loss during compression.

Strategy 3: Compaction (Hybrid)

Compaction combines both approaches. It maintains a rolling summary that gets updated incrementally as messages age out of the sliding window. Each time the window shifts, new messages are merged into the existing summary rather than re-summarizing the entire history.

class CompactionManager:
    def __init__(self, window_size: int = 20, summary: str = ""):
        self.window_size = window_size
        self.summary = summary
        self.messages: List[Dict[str, str]] = []

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    async def get_context(self, system_prompt: str) -> List[Dict[str, str]]:
        if len(self.messages) > self.window_size:
            overflow = self.messages[:-self.window_size]
            self.messages = self.messages[-self.window_size:]
            await self._update_summary(overflow)

        context = [{"role": "system", "content": system_prompt}]
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Context from earlier: {self.summary}",
            })
        context.extend(self.messages)
        return context

    async def _update_summary(self, new_messages):
        new_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in new_messages
        )
        client = openai.AsyncOpenAI()
        resp = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": (
                    f"Existing summary: {self.summary}\n\n"
                    f"New messages to incorporate:\n{new_text}\n\n"
                    "Produce an updated summary preserving all key facts."
                ),
            }],
            max_tokens=400,
        )
        self.summary = resp.choices[0].message.content

Choosing the Right Strategy

Strategy Complexity Long-Range Memory Extra LLM Calls Best For
Sliding Window Low None Zero Short conversations, chatbots
Summarization Medium Good Periodic Customer support, assistants
Compaction High Best Incremental Long-running agents, research tasks

For most production agents, compaction provides the best balance. It keeps recent messages verbatim for accuracy while maintaining a compressed record of everything that came before.

FAQ

How do I count tokens accurately instead of estimating?

Use the tiktoken library for OpenAI models. Call tiktoken.encoding_for_model("gpt-4o") to get an encoder, then len(encoder.encode(text)) for exact token counts. For Claude, Anthropic provides a token counting API endpoint.

Should the system message ever be summarized?

No. The system message defines the agent's behavior and should always remain in full. Only user and assistant messages should be candidates for summarization or eviction. Treat the system prompt as immutable context.

Can I combine sliding windows with an external memory store?

Yes, and this is a common production pattern. Use a sliding window for the immediate context, but persist all messages to a database or vector store. When the agent needs old information, it queries the external store and injects relevant results into the current context.


#ConversationHistory #ContextWindow #TokenManagement #LLMMemory #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.

AI Interview Prep

7 Agentic AI & Multi-Agent System Interview Questions for 2026

Real agentic AI and multi-agent system interview questions from Anthropic, OpenAI, and Microsoft in 2026. Covers agent design patterns, memory systems, safety, orchestration frameworks, tool calling, and evaluation.

Learn Agentic AI

Context Window Management for AI Agents: Summarization, Pruning, and Sliding Window Strategies

Managing context in long-running AI agents: conversation summarization, selective pruning, sliding window approaches, and when to leverage 1M token context versus optimization strategies.

Learn Agentic AI

Adaptive Thinking in Claude 4.6: How AI Agents Decide When and How Much to Reason

Technical exploration of adaptive thinking in Claude 4.6 — how the model dynamically adjusts reasoning depth, its impact on agent architectures, and practical implementation patterns.

Learn Agentic AI

How NVIDIA Vera CPU Solves the Agentic AI Bottleneck: Architecture Deep Dive

Technical analysis of NVIDIA's Vera CPU designed for agentic AI workloads — why the CPU is the bottleneck, how Vera's architecture addresses it, and what it means for agent performance.

Learn Agentic AI

Claude Opus 4.6 with 1M Context Window: Complete Developer Guide for Agentic AI

Complete guide to Claude Opus 4.6 GA — 1M context at standard pricing, 128K output tokens, adaptive thinking, and production patterns for building agentic AI systems.