---
title: "Contextual Compression for RAG: Reducing Retrieved Context to What Matters"
description: "Learn how contextual compression techniques strip irrelevant information from retrieved chunks before they reach the LLM, improving both answer quality and token efficiency."
canonical: https://callsphere.ai/blog/contextual-compression-rag-reducing-retrieved-context-what-matters
category: "Learn Agentic AI"
tags: ["Contextual Compression", "RAG", "Token Optimization", "LLM Context", "Retrieval"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T16:06:52.370Z
---

# Contextual Compression for RAG: Reducing Retrieved Context to What Matters

> Learn how contextual compression techniques strip irrelevant information from retrieved chunks before they reach the LLM, improving both answer quality and token efficiency.

## The Retrieval Noise Problem

When you retrieve the top 5 chunks from a vector store, each chunk is typically 500-1000 tokens. That is 2,500-5,000 tokens of context passed to your LLM. But here is the critical insight: usually only 10-20% of those tokens are actually relevant to the specific question being asked.

A chunk might be retrieved because it contains a paragraph about your topic, but the rest of the chunk covers unrelated details. This noise dilutes the signal, increases token costs, and — most importantly — can confuse the LLM into generating responses that blend relevant and irrelevant information.

Contextual compression addresses this by extracting or summarizing only the question-relevant portions of each retrieved document before passing them to the generator.

## Three Approaches to Compression

### 1. Extractive Compression

Extract only the sentences or passages that directly relate to the query. This preserves exact wording from the source, maintaining fidelity.

```mermaid
flowchart LR
    Q(["User query"])
    EMB["Embed query
text-embedding-3"]
    VEC[("Vector DB
pgvector or Pinecone")]
    RET["Top-k retrieval
k = 8"]
    PROMPT["Augmented prompt
system plus context"]
    LLM["LLM generation
Claude or GPT"]
    CITE["Inline citations
and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
```

### 2. LLM-Based Abstractive Compression

Use a language model to rewrite each chunk, keeping only query-relevant information. More flexible but introduces the possibility of subtle distortion.

### 3. Cross-Encoder Reranking with Truncation

Score individual sentences within each chunk for relevance, then keep only the top-scoring sentences. A hybrid approach that balances precision and speed.

## Implementing Extractive Compression

```python
from openai import OpenAI
import re

client = OpenAI()

def extractive_compress(
    query: str,
    documents: list[str],
) -> list[str]:
    """Extract only query-relevant sentences from each document."""
    compressed = []

    for doc in documents:
        # Split document into sentences
        sentences = re.split(r'(? list[str]:
    """Compress each document to only query-relevant content."""
    compressed = []

    for doc in documents:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": f"""Extract and summarize ONLY the
                information from this document that is relevant
                to answering the user's query. Omit everything
                else. Keep the summary under
                {max_tokens_per_doc} tokens. If nothing in the
                document is relevant, respond with 'NOT_RELEVANT'.
                """
            }, {
                "role": "user",
                "content": f"Query: {query}\n\nDocument: {doc}"
            }],
            max_tokens=max_tokens_per_doc,
        )
        result = response.choices[0].message.content.strip()
        if result != "NOT_RELEVANT":
            compressed.append(result)

    return compressed
```

## Fast Compression with Cross-Encoders

For production systems where LLM compression is too slow, use a cross-encoder to score individual sentences:

```python
from sentence_transformers import CrossEncoder
import re

# Load a small, fast cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def cross_encoder_compress(
    query: str,
    documents: list[str],
    top_sentences: int = 10,
) -> str:
    """Use cross-encoder to select most relevant sentences."""
    all_sentences = []
    for doc in documents:
        sentences = re.split(r'(? str:
    """RAG pipeline with contextual compression."""
    # Retrieve more documents than usual since we will compress
    raw_docs = retriever.search(query, k=10)

    # Compress based on strategy
    if compression == "extractive":
        context_docs = extractive_compress(query, raw_docs)
    elif compression == "abstractive":
        context_docs = abstractive_compress(query, raw_docs)
    elif compression == "cross_encoder":
        context_docs = [cross_encoder_compress(query, raw_docs)]
    else:
        context_docs = raw_docs

    context = "\n\n".join(context_docs)

    # Generate with compressed context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Answer using the provided context."
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }],
    )
    return response.choices[0].message.content
```

## Compression Ratios in Practice

In our testing, extractive compression reduces context by 60-75% while retaining answer quality. Abstractive compression achieves 70-85% reduction. Cross-encoder sentence selection achieves 80-90% reduction. The sweet spot depends on your use case — higher compression saves tokens but risks dropping subtle details that matter for nuanced questions.

## FAQ

### Does compression hurt answer quality?

When done well, compression actually improves answer quality because the LLM sees less noise. The risk is over-compression — removing context that seems irrelevant to a simple classifier but contains nuances the LLM needs. Monitor your answer quality metrics when tuning compression aggressiveness.

### Which compression method should I use in production?

Cross-encoder compression is the best starting point for production. It runs in milliseconds (no LLM call required), provides good compression ratios, and scales well. Graduate to LLM-based compression only if cross-encoder results are insufficient for your quality requirements.

### Can I combine compression with reranking?

Yes, and this is a powerful pattern. First rerank your retrieved documents to get the best ordering, then apply compression to the top-ranked results. This ensures you compress the most relevant documents rather than wasting compression effort on documents that would have been discarded anyway.

---

#ContextualCompression #RAG #TokenOptimization #LLMContext #Retrieval #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/contextual-compression-rag-reducing-retrieved-context-what-matters
