Skip to content
Learn Agentic AI
Learn Agentic AI9 min read1 views

Contextual Compression for RAG: Reducing Retrieved Context to What Matters

Learn how contextual compression techniques strip irrelevant information from retrieved chunks before they reach the LLM, improving both answer quality and token efficiency.

The Retrieval Noise Problem

When you retrieve the top 5 chunks from a vector store, each chunk is typically 500-1000 tokens. That is 2,500-5,000 tokens of context passed to your LLM. But here is the critical insight: usually only 10-20% of those tokens are actually relevant to the specific question being asked.

A chunk might be retrieved because it contains a paragraph about your topic, but the rest of the chunk covers unrelated details. This noise dilutes the signal, increases token costs, and — most importantly — can confuse the LLM into generating responses that blend relevant and irrelevant information.

Contextual compression addresses this by extracting or summarizing only the question-relevant portions of each retrieved document before passing them to the generator.

Three Approaches to Compression

1. Extractive Compression

Extract only the sentences or passages that directly relate to the query. This preserves exact wording from the source, maintaining fidelity.

flowchart TD
    START["Contextual Compression for RAG: Reducing Retrieve…"] --> A
    A["The Retrieval Noise Problem"]
    A --> B
    B["Three Approaches to Compression"]
    B --> C
    C["Implementing Extractive Compression"]
    C --> D
    D["LLM-Based Abstractive Compression"]
    D --> E
    E["Fast Compression with Cross-Encoders"]
    E --> F
    F["Putting It All Together"]
    F --> G
    G["Compression Ratios in Practice"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

2. LLM-Based Abstractive Compression

Use a language model to rewrite each chunk, keeping only query-relevant information. More flexible but introduces the possibility of subtle distortion.

3. Cross-Encoder Reranking with Truncation

Score individual sentences within each chunk for relevance, then keep only the top-scoring sentences. A hybrid approach that balances precision and speed.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Implementing Extractive Compression

from openai import OpenAI
import re

client = OpenAI()

def extractive_compress(
    query: str,
    documents: list[str],
) -> list[str]:
    """Extract only query-relevant sentences from each document."""
    compressed = []

    for doc in documents:
        # Split document into sentences
        sentences = re.split(r'(?<=[.!?])\s+', doc)

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": """Given a query and numbered sentences,
                return a JSON object with a "relevant_indices" key
                containing a list of sentence numbers (0-indexed)
                that are relevant to answering the query.
                Only include directly relevant sentences."""
            }, {
                "role": "user",
                "content": (
                    f"Query: {query}\n\nSentences:\n"
                    + "\n".join(
                        f"[{i}] {s}"
                        for i, s in enumerate(sentences)
                    )
                )
            }],
            response_format={"type": "json_object"}
        )
        import json
        result = json.loads(
            response.choices[0].message.content
        )
        indices = result.get("relevant_indices", [])

        relevant_text = " ".join(
            sentences[i] for i in indices
            if i < len(sentences)
        )
        if relevant_text.strip():
            compressed.append(relevant_text)

    return compressed

LLM-Based Abstractive Compression

When exact sentences are too fragmented, abstractive compression creates coherent summaries:

flowchart TD
    ROOT["Contextual Compression for RAG: Reducing Ret…"] 
    ROOT --> P0["Three Approaches to Compression"]
    P0 --> P0C0["1. Extractive Compression"]
    P0 --> P0C1["2. LLM-Based Abstractive Compression"]
    P0 --> P0C2["3. Cross-Encoder Reranking with Truncat…"]
    ROOT --> P1["FAQ"]
    P1 --> P1C0["Does compression hurt answer quality?"]
    P1 --> P1C1["Which compression method should I use i…"]
    P1 --> P1C2["Can I combine compression with rerankin…"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
def abstractive_compress(
    query: str,
    documents: list[str],
    max_tokens_per_doc: int = 150,
) -> list[str]:
    """Compress each document to only query-relevant content."""
    compressed = []

    for doc in documents:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": f"""Extract and summarize ONLY the
                information from this document that is relevant
                to answering the user's query. Omit everything
                else. Keep the summary under
                {max_tokens_per_doc} tokens. If nothing in the
                document is relevant, respond with 'NOT_RELEVANT'.
                """
            }, {
                "role": "user",
                "content": f"Query: {query}\n\nDocument: {doc}"
            }],
            max_tokens=max_tokens_per_doc,
        )
        result = response.choices[0].message.content.strip()
        if result != "NOT_RELEVANT":
            compressed.append(result)

    return compressed

Fast Compression with Cross-Encoders

For production systems where LLM compression is too slow, use a cross-encoder to score individual sentences:

from sentence_transformers import CrossEncoder
import re

# Load a small, fast cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def cross_encoder_compress(
    query: str,
    documents: list[str],
    top_sentences: int = 10,
) -> str:
    """Use cross-encoder to select most relevant sentences."""
    all_sentences = []
    for doc in documents:
        sentences = re.split(r'(?<=[.!?])\s+', doc)
        all_sentences.extend(sentences)

    # Score all sentences against the query
    pairs = [[query, sent] for sent in all_sentences]
    scores = reranker.predict(pairs)

    # Rank and select top sentences
    scored = sorted(
        zip(all_sentences, scores),
        key=lambda x: x[1],
        reverse=True,
    )

    top = scored[:top_sentences]
    # Return in original order for coherence
    ordered = sorted(
        top,
        key=lambda x: all_sentences.index(x[0]),
    )
    return " ".join(sent for sent, _ in ordered)

Putting It All Together

A complete compression-augmented RAG pipeline:

flowchart LR
    S0["1. Extractive Compression"]
    S0 --> S1
    S1["2. LLM-Based Abstractive Compression"]
    S1 --> S2
    S2["3. Cross-Encoder Reranking with Truncat…"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S2 fill:#059669,stroke:#047857,color:#fff
def compressed_rag(
    query: str,
    retriever,
    compression: str = "extractive",
) -> str:
    """RAG pipeline with contextual compression."""
    # Retrieve more documents than usual since we will compress
    raw_docs = retriever.search(query, k=10)

    # Compress based on strategy
    if compression == "extractive":
        context_docs = extractive_compress(query, raw_docs)
    elif compression == "abstractive":
        context_docs = abstractive_compress(query, raw_docs)
    elif compression == "cross_encoder":
        context_docs = [cross_encoder_compress(query, raw_docs)]
    else:
        context_docs = raw_docs

    context = "\n\n".join(context_docs)

    # Generate with compressed context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Answer using the provided context."
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }],
    )
    return response.choices[0].message.content

Compression Ratios in Practice

In our testing, extractive compression reduces context by 60-75% while retaining answer quality. Abstractive compression achieves 70-85% reduction. Cross-encoder sentence selection achieves 80-90% reduction. The sweet spot depends on your use case — higher compression saves tokens but risks dropping subtle details that matter for nuanced questions.

FAQ

Does compression hurt answer quality?

When done well, compression actually improves answer quality because the LLM sees less noise. The risk is over-compression — removing context that seems irrelevant to a simple classifier but contains nuances the LLM needs. Monitor your answer quality metrics when tuning compression aggressiveness.

Which compression method should I use in production?

Cross-encoder compression is the best starting point for production. It runs in milliseconds (no LLM call required), provides good compression ratios, and scales well. Graduate to LLM-based compression only if cross-encoder results are insufficient for your quality requirements.

Can I combine compression with reranking?

Yes, and this is a powerful pattern. First rerank your retrieved documents to get the best ordering, then apply compression to the top-ranked results. This ensures you compress the most relevant documents rather than wasting compression effort on documents that would have been discarded anyway.


#ContextualCompression #RAG #TokenOptimization #LLMContext #Retrieval #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

guides

Understanding AI Voice Technology: A Beginner's Guide

A plain-English guide to AI voice technology — LLMs, STT, TTS, RAG, function calling, and latency budgets. Learn how modern voice agents actually work.

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Semantic Search for AI Agents: Embedding Models, Chunking Strategies, and Retrieval Optimization

Comprehensive guide to semantic search for AI agents covering embedding model selection, document chunking strategies, and retrieval optimization techniques for production systems.

Learn Agentic AI

Token-Efficient Agent Design: Reducing LLM Costs Without Sacrificing Quality

Practical strategies for reducing LLM token costs in agentic systems including compact prompts, tool result summarization, selective context, and model tiering approaches.