Skip to content
Agentic AI
Agentic AI9 min read5 views

Agentic AI Context Optimization: Managing Million-Token Agent Conversations

Optimize million-token context windows for agentic AI with summarization, compression, sliding windows, and hierarchical context injection.

The Context Window Is Your Agent's Working Memory

Every piece of information in the context window competes for the model's attention. System prompts, conversation history, tool definitions, tool results, retrieved documents — they all consume tokens and influence the model's behavior. As agent conversations grow longer and tools return large payloads, context management becomes a critical engineering challenge.

Modern models offer large context windows — Claude supports up to 200K tokens, Gemini supports up to 1M tokens, and GPT-4o supports 128K tokens. But larger windows do not solve the problem. Research consistently shows that model performance degrades on information placed in the middle of long contexts (the "lost in the middle" effect). Throwing everything into the context is not a strategy — it is an anti-pattern.

Effective context management means putting the right information in the right place at the right time, and aggressively removing information that is no longer relevant.

Conversation Summarization

Long-running agent conversations accumulate history that is no longer directly relevant. A customer support session that started with account verification twenty turns ago does not need those verification turns in full detail — a summary suffices.

flowchart TD
    START["Agentic AI Context Optimization: Managing Million…"] --> A
    A["The Context Window Is Your Agent39s Wor…"]
    A --> B
    B["Conversation Summarization"]
    B --> C
    C["Sliding Window Techniques"]
    C --> D
    D["Context Compression"]
    D --> E
    E["Selective Memory Injection"]
    E --> F
    F["Hierarchical Context Structure"]
    F --> G
    G["Frequently Asked Questions"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Rolling Summarization

After every N turns (typically 5-10), summarize the oldest unsummarized turns and replace them with the summary. This keeps the full context within a budget while preserving the key information.

class ConversationSummarizer:
    def __init__(self, llm_client, max_full_turns: int = 10):
        self.llm = llm_client
        self.max_full_turns = max_full_turns
        self.summaries: list[str] = []
        self.full_turns: list[dict] = []

    async def add_turn(self, role: str, content: str):
        self.full_turns.append({"role": role, "content": content})

        if len(self.full_turns) > self.max_full_turns:
            # Summarize oldest turns
            turns_to_summarize = self.full_turns[:5]
            summary = await self._summarize_turns(turns_to_summarize)
            self.summaries.append(summary)
            self.full_turns = self.full_turns[5:]

    async def _summarize_turns(self, turns: list[dict]) -> str:
        turn_text = "\n".join(
            f"{t['role']}: {t['content']}" for t in turns
        )
        response = await self.llm.chat(
            system="Summarize this conversation segment concisely. "
                   "Preserve key decisions, facts, and action items. "
                   "Omit pleasantries and redundant confirmations.",
            messages=[{"role": "user", "content": turn_text}],
        )
        return response

    def build_context(self) -> list[dict]:
        context = []
        if self.summaries:
            summary_block = "\n\n".join(self.summaries)
            context.append({
                "role": "system",
                "content": f"Previous conversation summary:\n{summary_block}",
            })
        context.extend(self.full_turns)
        return context

Importance-Based Retention

Not all turns are equal. Turns where the user provided key information (account number, problem description, preferences) or where the agent made important decisions should be retained in full, while routine exchanges can be summarized more aggressively.

class ImportanceScorer:
    HIGH_IMPORTANCE_SIGNALS = [
        "account", "order", "booking", "confirmed", "agreed",
        "decided", "problem is", "issue is", "error",
    ]

    def score_turn(self, turn: dict) -> float:
        content_lower = turn["content"].lower()
        score = 0.5  # Base score

        # Tool calls are always important
        if turn.get("tool_calls"):
            score += 0.3

        # Key information signals
        for signal in self.HIGH_IMPORTANCE_SIGNALS:
            if signal in content_lower:
                score += 0.1

        # Long turns tend to contain more information
        word_count = len(turn["content"].split())
        if word_count > 100:
            score += 0.1

        return min(score, 1.0)

Sliding Window Techniques

For agents that process streams of data (monitoring agents, chat agents handling rapid-fire messages), a sliding window ensures the context stays current without growing unbounded.

Token-Budget Sliding Window

Instead of a fixed number of turns, define a token budget for conversation history and drop the oldest turns when the budget is exceeded.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import tiktoken

class TokenBudgetWindow:
    def __init__(self, token_budget: int = 50000, model: str = "gpt-4o"):
        self.token_budget = token_budget
        self.encoder = tiktoken.encoding_for_model(model)
        self.turns: list[dict] = []

    def count_tokens(self, text: str) -> int:
        return len(self.encoder.encode(text))

    def add_turn(self, turn: dict):
        self.turns.append(turn)
        self._enforce_budget()

    def _enforce_budget(self):
        total = sum(
            self.count_tokens(t["content"]) for t in self.turns
        )
        while total > self.token_budget and len(self.turns) > 1:
            removed = self.turns.pop(0)
            total -= self.count_tokens(removed["content"])

    def get_turns(self) -> list[dict]:
        return self.turns

Context Compression

Sometimes you need all the information in the context but in a more compact form. Context compression techniques reduce token count while preserving information density.

flowchart TD
    ROOT["Agentic AI Context Optimization: Managing Mi…"] 
    ROOT --> P0["Conversation Summarization"]
    P0 --> P0C0["Rolling Summarization"]
    P0 --> P0C1["Importance-Based Retention"]
    ROOT --> P1["Sliding Window Techniques"]
    P1 --> P1C0["Token-Budget Sliding Window"]
    ROOT --> P2["Context Compression"]
    P2 --> P2C0["Tool Result Compression"]
    P2 --> P2C1["Structured Data Summarization"]
    ROOT --> P3["Selective Memory Injection"]
    P3 --> P3C0["Relevance-Based Memory Loading"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b

Tool Result Compression

Tool results are often the largest context consumers. A database query might return 50 rows when the agent only needs 3. A web search might return full page content when the agent only needs key paragraphs.

class ToolResultCompressor:
    def __init__(self, llm_client):
        self.llm = llm_client

    async def compress_tool_result(
        self,
        tool_name: str,
        raw_result: str,
        user_query: str,
        max_tokens: int = 500,
    ) -> str:
        if self.count_tokens(raw_result) <= max_tokens:
            return raw_result

        compressed = await self.llm.chat(
            system=(
                f"Compress the following {tool_name} result to under "
                f"{max_tokens} tokens. Preserve all information relevant "
                f"to the user's query. Remove redundant or irrelevant data."
            ),
            messages=[
                {
                    "role": "user",
                    "content": f"User query: {user_query}\n\n"
                               f"Tool result:\n{raw_result}",
                }
            ],
        )
        return compressed

Structured Data Summarization

When tools return tabular data, convert it to a narrative summary rather than including the raw table.

def summarize_table_result(
    rows: list[dict],
    query_context: str
) -> str:
    if len(rows) <= 5:
        # Small result set, include as-is
        return format_as_table(rows)

    # Summarize large result sets
    summary_parts = [
        f"Query returned {len(rows)} results.",
        f"Key statistics:",
    ]

    # Add relevant aggregations based on data types
    numeric_cols = [k for k, v in rows[0].items() if isinstance(v, (int, float))]
    for col in numeric_cols:
        values = [r[col] for r in rows if r.get(col) is not None]
        if values:
            summary_parts.append(
                f"  - {col}: min={min(values)}, max={max(values)}, "
                f"avg={sum(values)/len(values):.1f}"
            )

    # Include top 5 results
    summary_parts.append(f"\nTop 5 results:")
    for row in rows[:5]:
        summary_parts.append(f"  {row}")

    return "\n".join(summary_parts)

Selective Memory Injection

Not all agent memory should be in the context at all times. Selective injection loads relevant memories on demand based on the current conversation turn.

flowchart TD
    CENTER(("Key Components"))
    CENTER --> N0["System layer static: Agent identity, ro…"]
    CENTER --> N1["Conversation layer dynamic: Recent conv…"]
    CENTER --> N2["Retrieval layer per-turn: RAG results, …"]
    CENTER --> N3["Instruction layer static: Output format…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff

Relevance-Based Memory Loading

class SelectiveMemory:
    def __init__(self, vector_store, max_memory_tokens: int = 2000):
        self.vector_store = vector_store
        self.max_memory_tokens = max_memory_tokens

    async def get_relevant_memories(
        self,
        current_message: str,
        session_id: str,
    ) -> str:
        embedding = await generate_embedding(current_message)
        memories = await self.vector_store.query(
            vector=embedding,
            top_k=10,
            filter={"session_id": session_id},
        )

        # Select memories that fit within token budget
        selected = []
        token_count = 0
        for memory in memories.matches:
            memory_tokens = count_tokens(memory.metadata["content"])
            if token_count + memory_tokens > self.max_memory_tokens:
                break
            selected.append(memory.metadata["content"])
            token_count += memory_tokens

        if not selected:
            return ""

        return "Relevant context from earlier in this session:\n" + "\n".join(selected)

Hierarchical Context Structure

Organize the context window into layers with different update frequencies and priority levels.

The Context Hierarchy

  1. System layer (static): Agent identity, role, rules, capabilities — loaded once per session
  2. Session layer (slow-changing): User profile, session metadata, business rules — updated on session events
  3. Conversation layer (dynamic): Recent conversation history — updated every turn
  4. Retrieval layer (per-turn): RAG results, tool outputs — replaced each turn
  5. Instruction layer (static): Output format requirements, safety constraints — loaded once
class HierarchicalContext:
    def __init__(self, total_budget: int = 100000):
        self.budgets = {
            "system": int(total_budget * 0.15),
            "session": int(total_budget * 0.10),
            "conversation": int(total_budget * 0.40),
            "retrieval": int(total_budget * 0.25),
            "instruction": int(total_budget * 0.10),
        }
        self.layers: dict[str, str] = {}

    def set_layer(self, layer: str, content: str):
        tokens = count_tokens(content)
        if tokens > self.budgets[layer]:
            content = truncate_to_tokens(content, self.budgets[layer])
        self.layers[layer] = content

    def build_prompt(self) -> str:
        ordered = ["system", "session", "instruction", "retrieval", "conversation"]
        parts = []
        for layer in ordered:
            if layer in self.layers and self.layers[layer]:
                parts.append(self.layers[layer])
        return "\n\n---\n\n".join(parts)

Token Budgeting Per Agent

Different agents need different context distributions. A customer support agent needs more conversation history budget (to maintain context across a long troubleshooting session) while a research agent needs more retrieval budget (to incorporate multiple sources). Define per-agent token budgets as configuration.

Frequently Asked Questions

Does a larger context window mean better agent performance?

Not necessarily. Larger context windows allow more information to be included, but model attention degrades with length. The "lost in the middle" effect means information placed in the middle of long contexts is less likely to be used by the model. Strategic context management — putting the most relevant information at the beginning and end of the context — typically outperforms simply filling a large window with everything available.

How often should conversation history be summarized?

Summarize when the conversation history exceeds your token budget for that context layer. A common approach is to summarize every 5-10 turns, keeping the most recent turns in full detail and older turns as summaries. For high-stakes conversations (financial transactions, medical consultations), retain more turns in full to ensure no critical detail is lost in summarization.

What is the cost impact of large context windows?

LLM API pricing is typically per-token for both input and output. A 100K token context costs roughly 20-50x more per request than a 5K token context, depending on the model. Context optimization directly reduces API costs. Aggressive summarization and compression can reduce context size by 60-80% without meaningful quality loss for most agent applications.

How do you handle tool results that exceed the token budget?

Three strategies: truncation (cut the result to fit, losing tail data), compression (use an LLM to summarize the result, preserving the most relevant information), and pagination (return a subset of results with a "get more" tool the agent can call if needed). Compression is generally preferred because it preserves relevance, but pagination works well for structured data like database query results.

Should each agent in a multi-agent system have its own context window?

Yes. Each agent should maintain its own context optimized for its role. A triage agent needs minimal context (just the current request). A specialist agent needs rich domain context. A supervisor agent needs summaries from subordinate agents. Sharing a single context across all agents leads to bloat and confusion.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

Technical Guides

Scaling AI Voice Agents to 1000+ Concurrent Calls: Architecture Guide

Architecture patterns for scaling AI voice agents to 1000+ concurrent calls — horizontal scaling, connection pooling, and queue management.

Learn Agentic AI

Sub-500ms Latency Voice Agents: Architecture Patterns for Production Deployment

Technical deep dive into achieving under 500ms voice agent latency with streaming architectures, edge deployment, connection pooling, pre-warming, and async tool execution.

Learn Agentic AI

Context Window Management for AI Agents: Summarization, Pruning, and Sliding Window Strategies

Managing context in long-running AI agents: conversation summarization, selective pruning, sliding window approaches, and when to leverage 1M token context versus optimization strategies.

Learn Agentic AI

Token-Efficient Agent Design: Reducing LLM Costs Without Sacrificing Quality

Practical strategies for reducing LLM token costs in agentic systems including compact prompts, tool result summarization, selective context, and model tiering approaches.

Learn Agentic AI

How NVIDIA Vera CPU Solves the Agentic AI Bottleneck: Architecture Deep Dive

Technical analysis of NVIDIA's Vera CPU designed for agentic AI workloads — why the CPU is the bottleneck, how Vera's architecture addresses it, and what it means for agent performance.