Contextual Compression for RAG: Reducing Retrieved Context to What Matters

The Retrieval Noise Problem

When you retrieve the top 5 chunks from a vector store, each chunk is typically 500-1000 tokens. That is 2,500-5,000 tokens of context passed to your LLM. But here is the critical insight: usually only 10-20% of those tokens are actually relevant to the specific question being asked.

A chunk might be retrieved because it contains a paragraph about your topic, but the rest of the chunk covers unrelated details. This noise dilutes the signal, increases token costs, and — most importantly — can confuse the LLM into generating responses that blend relevant and irrelevant information.

Contextual compression addresses this by extracting or summarizing only the question-relevant portions of each retrieved document before passing them to the generator.

Three Approaches to Compression

1. Extractive Compression

Extract only the sentences or passages that directly relate to the query. This preserves exact wording from the source, maintaining fidelity.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

2. LLM-Based Abstractive Compression

Use a language model to rewrite each chunk, keeping only query-relevant information. More flexible but introduces the possibility of subtle distortion.

3. Cross-Encoder Reranking with Truncation

Score individual sentences within each chunk for relevance, then keep only the top-scoring sentences. A hybrid approach that balances precision and speed.

Implementing Extractive Compression

from openai import OpenAI
import re

client = OpenAI()

def extractive_compress(
    query: str,
    documents: list[str],
) -> list[str]:
    """Extract only query-relevant sentences from each document."""
    compressed = []

    for doc in documents:
        # Split document into sentences
        sentences = re.split(r'(?<=[.!?])\s+', doc)

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": """Given a query and numbered sentences,
                return a JSON object with a "relevant_indices" key
                containing a list of sentence numbers (0-indexed)
                that are relevant to answering the query.
                Only include directly relevant sentences."""
            }, {
                "role": "user",
                "content": (
                    f"Query: {query}\n\nSentences:\n"
                    + "\n".join(
                        f"[{i}] {s}"
                        for i, s in enumerate(sentences)
                    )
                )
            }],
            response_format={"type": "json_object"}
        )
        import json
        result = json.loads(
            response.choices[0].message.content
        )
        indices = result.get("relevant_indices", [])

        relevant_text = " ".join(
            sentences[i] for i in indices
            if i < len(sentences)
        )
        if relevant_text.strip():
            compressed.append(relevant_text)

    return compressed

LLM-Based Abstractive Compression

When exact sentences are too fragmented, abstractive compression creates coherent summaries:

def abstractive_compress(
    query: str,
    documents: list[str],
    max_tokens_per_doc: int = 150,
) -> list[str]:
    """Compress each document to only query-relevant content."""
    compressed = []

    for doc in documents:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": f"""Extract and summarize ONLY the
                information from this document that is relevant
                to answering the user's query. Omit everything
                else. Keep the summary under
                {max_tokens_per_doc} tokens. If nothing in the
                document is relevant, respond with 'NOT_RELEVANT'.
                """
            }, {
                "role": "user",
                "content": f"Query: {query}\n\nDocument: {doc}"
            }],
            max_tokens=max_tokens_per_doc,
        )
        result = response.choices[0].message.content.strip()
        if result != "NOT_RELEVANT":
            compressed.append(result)

    return compressed

Fast Compression with Cross-Encoders

For production systems where LLM compression is too slow, use a cross-encoder to score individual sentences:

from sentence_transformers import CrossEncoder
import re

# Load a small, fast cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def cross_encoder_compress(
    query: str,
    documents: list[str],
    top_sentences: int = 10,
) -> str:
    """Use cross-encoder to select most relevant sentences."""
    all_sentences = []
    for doc in documents:
        sentences = re.split(r'(?<=[.!?])\s+', doc)
        all_sentences.extend(sentences)

    # Score all sentences against the query
    pairs = [[query, sent] for sent in all_sentences]
    scores = reranker.predict(pairs)

    # Rank and select top sentences
    scored = sorted(
        zip(all_sentences, scores),
        key=lambda x: x[1],
        reverse=True,
    )

    top = scored[:top_sentences]
    # Return in original order for coherence
    ordered = sorted(
        top,
        key=lambda x: all_sentences.index(x[0]),
    )
    return " ".join(sent for sent, _ in ordered)

Putting It All Together

A complete compression-augmented RAG pipeline:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

def compressed_rag(
    query: str,
    retriever,
    compression: str = "extractive",
) -> str:
    """RAG pipeline with contextual compression."""
    # Retrieve more documents than usual since we will compress
    raw_docs = retriever.search(query, k=10)

    # Compress based on strategy
    if compression == "extractive":
        context_docs = extractive_compress(query, raw_docs)
    elif compression == "abstractive":
        context_docs = abstractive_compress(query, raw_docs)
    elif compression == "cross_encoder":
        context_docs = [cross_encoder_compress(query, raw_docs)]
    else:
        context_docs = raw_docs

    context = "\n\n".join(context_docs)

    # Generate with compressed context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Answer using the provided context."
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }],
    )
    return response.choices[0].message.content

Compression Ratios in Practice

In our testing, extractive compression reduces context by 60-75% while retaining answer quality. Abstractive compression achieves 70-85% reduction. Cross-encoder sentence selection achieves 80-90% reduction. The sweet spot depends on your use case — higher compression saves tokens but risks dropping subtle details that matter for nuanced questions.

FAQ

Does compression hurt answer quality?

When done well, compression actually improves answer quality because the LLM sees less noise. The risk is over-compression — removing context that seems irrelevant to a simple classifier but contains nuances the LLM needs. Monitor your answer quality metrics when tuning compression aggressiveness.

Which compression method should I use in production?

Cross-encoder compression is the best starting point for production. It runs in milliseconds (no LLM call required), provides good compression ratios, and scales well. Graduate to LLM-based compression only if cross-encoder results are insufficient for your quality requirements.

Can I combine compression with reranking?

Yes, and this is a powerful pattern. First rerank your retrieved documents to get the best ordering, then apply compression to the top-ranked results. This ensures you compress the most relevant documents rather than wasting compression effort on documents that would have been discarded anyway.

#ContextualCompression #RAG #TokenOptimization #LLMContext #Retrieval #AgenticAI #LearnAI #AIEngineering

Contextual Compression for RAG: Reducing Retrieved Context to What Matters

The Retrieval Noise Problem

Three Approaches to Compression

1. Extractive Compression

2. LLM-Based Abstractive Compression

3. Cross-Encoder Reranking with Truncation

Implementing Extractive Compression

LLM-Based Abstractive Compression

Fast Compression with Cross-Encoders

Putting It All Together

Compression Ratios in Practice

FAQ

Does compression hurt answer quality?

Which compression method should I use in production?

Can I combine compression with reranking?

Try CallSphere AI Voice Agents

Related Articles You May Like

Chatbot for Answering Questions: How to Build One That Works

Graphiti: How Temporal Knowledge Graphs Give AI Voice Agents Persistent Memory (2026 Guide)

How To Create A Chatbot In 2026: A Founder's Practical Guide

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026