Skip to content
Learn Agentic AI
Learn Agentic AI9 min read9 views

Response Compaction: Managing Long Agent Conversations

Master OpenAIResponsesCompactionSession for automatic and manual compaction of long agent conversations including token management, custom triggers, and compaction strategies.

The Long Conversation Problem

Every AI agent faces a fundamental constraint: the context window. A conversation that starts with a simple question and evolves over dozens of turns accumulates history. At some point, the raw history exceeds the model's context limit — or the input token cost becomes untenable.

Naive solutions (truncating the oldest messages, using a sliding window) throw away potentially important context. The user might reference something from the beginning of the conversation, and if you dropped it, the agent hallucinates or asks the user to repeat themselves.

Response compaction is a smarter approach: instead of dropping old messages, the system summarizes them — compressing the history into a shorter representation that preserves the essential information.

OpenAIResponsesCompactionSession

The OpenAI Agents SDK provides OpenAIResponsesCompactionSession — a session wrapper that automatically compacts conversation history when it gets too long.

flowchart TD
    START["Response Compaction: Managing Long Agent Conversa…"] --> A
    A["The Long Conversation Problem"]
    A --> B
    B["OpenAIResponsesCompactionSession"]
    B --> C
    C["How Auto-Compaction Works"]
    C --> D
    D["Manual Compaction with run_compaction"]
    D --> E
    E["Disabling Auto-Compaction"]
    E --> F
    F["Custom Compaction Triggers with should_…"]
    F --> G
    G["Token Management in Long Conversations"]
    G --> H
    H["What Gets Preserved During Compaction"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from agents.extensions.sessions import (
    SQLiteSession,
    OpenAIResponsesCompactionSession,
)

base_session = SQLiteSession(db_path="./conversations.db")

compaction_session = OpenAIResponsesCompactionSession(
    session=base_session,
)

This wraps any base session with compaction capabilities. When the conversation history crosses a token threshold, the session automatically summarizes older turns before they are sent to the model.

How Auto-Compaction Works

The compaction session monitors the token count of the conversation history. When it crosses the configured threshold, it triggers compaction automatically:

  1. The session estimates the token count of all stored items.
  2. If the count exceeds the threshold, compaction is triggered.
  3. The older portion of the conversation is sent to the model for summarization.
  4. The summary replaces the detailed history.
  5. Recent messages are preserved in full detail.
from agents import Agent, Runner
from agents.extensions.sessions import (
    SQLiteSession,
    OpenAIResponsesCompactionSession,
)

base = SQLiteSession(db_path="./compact_demo.db")
session = OpenAIResponsesCompactionSession(session=base)

agent = Agent(
    name="LongConversationAgent",
    instructions="You are a research assistant helping with a long project.",
)

# This conversation can run for hundreds of turns
# Compaction kicks in automatically when history gets too long
async def research_session(session_id: str):
    questions = [
        "Let's research quantum computing applications.",
        "What about quantum error correction?",
        "How does surface code work?",
        # ... hundreds more turns
        "Summarize everything we've discussed about error correction.",
    ]

    for q in questions:
        result = await Runner.run(
            agent, q, session=session, session_id=session_id
        )
        print(result.final_output)

The agent can handle arbitrarily long conversations without hitting context limits or accumulating unbounded costs.

Manual Compaction with run_compaction()

Sometimes you want to trigger compaction explicitly — for example, at the end of a logical section of conversation, or before a handoff to another agent.

flowchart TD
    ROOT["Response Compaction: Managing Long Agent Con…"] 
    ROOT --> P0["Custom Compaction Triggers with should_…"]
    P0 --> P0C0["Advanced: Time-Based Compaction"]
    ROOT --> P1["Token Management in Long Conversations"]
    P1 --> P1C0["Layer 1: Session Limits"]
    P1 --> P1C1["Layer 2: Compaction"]
    P1 --> P1C2["Layer 3: Token Budgeting"]
    P1 --> P1C3["Combining All Layers"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
from agents.extensions.sessions import (
    SQLiteSession,
    OpenAIResponsesCompactionSession,
)

base = SQLiteSession(db_path="./sessions.db")
session = OpenAIResponsesCompactionSession(session=base)

# After a long discussion, manually compact
await session.run_compaction(session_id="project-alpha")

# Now the history is summarized and shorter
items = await session.get_items("project-alpha")
print(f"Items after compaction: {len(items)}")

Manual compaction is useful at natural conversation boundaries:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

async def handle_conversation_phase(
    session: OpenAIResponsesCompactionSession,
    session_id: str,
    agent: Agent,
    messages: list[str],
):
    """Process a phase of conversation, then compact."""
    for msg in messages:
        await Runner.run(agent, msg, session=session, session_id=session_id)

    # Compact after each phase to keep history manageable
    await session.run_compaction(session_id)
    print(f"Phase complete, history compacted for {session_id}")

Disabling Auto-Compaction

If you want full control over when compaction happens, disable the automatic trigger:

session = OpenAIResponsesCompactionSession(
    session=base_session,
    auto_compact=False,  # Disable automatic compaction
)

# Now compaction only happens when you call it explicitly
await session.run_compaction(session_id)

This is useful when:

  • You have custom logic for when compaction should occur
  • You want to compact only at specific conversation milestones
  • You need to ensure compaction does not interrupt time-sensitive interactions

Custom Compaction Triggers with should_trigger_compaction

For fine-grained control, implement a custom callback that decides when compaction should fire:

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["The session estimates the token count o…"]
    CENTER --> N1["If the count exceeds the threshold, com…"]
    CENTER --> N2["The older portion of the conversation i…"]
    CENTER --> N3["The summary replaces the detailed histo…"]
    CENTER --> N4["Recent messages are preserved in full d…"]
    CENTER --> N5["You have custom logic for when compacti…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
from agents.extensions.sessions import (
    SQLiteSession,
    OpenAIResponsesCompactionSession,
)

def custom_trigger(items: list, token_estimate: int) -> bool:
    """Custom logic for when to trigger compaction."""
    # Compact if over 50,000 tokens
    if token_estimate > 50_000:
        return True

    # Compact if over 100 items regardless of token count
    if len(items) > 100:
        return True

    # Don't compact small conversations
    return False

base = SQLiteSession(db_path="./sessions.db")
session = OpenAIResponsesCompactionSession(
    session=base,
    should_trigger_compaction=custom_trigger,
)

Advanced: Time-Based Compaction

Compact history that is older than a certain threshold:

from datetime import datetime, timedelta

def time_based_trigger(items: list, token_estimate: int) -> bool:
    """Compact if the oldest item is more than 2 hours old."""
    if not items:
        return False

    oldest_timestamp = items[0].get("created_at")
    if oldest_timestamp:
        age = datetime.utcnow() - datetime.fromisoformat(oldest_timestamp)
        if age > timedelta(hours=2) and token_estimate > 10_000:
            return True

    return False

Token Management in Long Conversations

Compaction is one part of a broader token management strategy. Here is a complete approach:

Layer 1: Session Limits

Cap the number of items loaded from the session:

from agents.extensions.sessions import SessionSettings

settings = SessionSettings(limit=50)

Layer 2: Compaction

Summarize older history to reduce token usage:

session = OpenAIResponsesCompactionSession(session=base)

Layer 3: Token Budgeting

Track and budget token usage across the conversation:

class TokenBudgetManager:
    def __init__(self, max_input_tokens: int = 100_000):
        self.max_input_tokens = max_input_tokens
        self.total_input_tokens = 0
        self.total_output_tokens = 0

    def track_usage(self, result):
        """Track token usage from a run result."""
        usage = result.raw_responses[-1].usage
        self.total_input_tokens += usage.input_tokens
        self.total_output_tokens += usage.output_tokens

    def should_compact(self) -> bool:
        """Signal compaction when approaching budget."""
        return self.total_input_tokens > self.max_input_tokens * 0.8

    def get_report(self) -> dict:
        return {
            "total_input": self.total_input_tokens,
            "total_output": self.total_output_tokens,
            "budget_remaining": self.max_input_tokens - self.total_input_tokens,
        }

Combining All Layers

budget = TokenBudgetManager(max_input_tokens=200_000)

async def managed_conversation(session_id: str, message: str):
    result = await Runner.run(
        agent,
        message,
        session=compaction_session,
        session_id=session_id,
        session_settings=SessionSettings(limit=80),
    )

    budget.track_usage(result)

    if budget.should_compact():
        await compaction_session.run_compaction(session_id)
        print("Compacted due to token budget pressure")

    return result.final_output

What Gets Preserved During Compaction

Compaction is not lossy — it is a summarization. The model that performs compaction is instructed to preserve:

  • Key facts and decisions made during the conversation
  • User preferences and stated requirements
  • Action items and commitments
  • Names, dates, numbers, and other specific details
  • The overall trajectory and context of the conversation

What gets compressed:

  • Verbose explanations that can be summarized
  • Back-and-forth clarification exchanges
  • Redundant information repeated across turns
  • Tool call details (replaced with outcome summaries)

The result is a compact representation that captures the essence of the conversation while using far fewer tokens.

Sources:

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.

Technical Guides

How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)

A full technical walkthrough of how modern AI voice agents work — speech-to-text, LLM orchestration, TTS, tool calling, and sub-second latency.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.