---
title: "Re-Ranking Search Results with Cross-Encoders: Improving Retrieval Precision"
description: "Understand the difference between bi-encoders and cross-encoders, then build a re-ranking pipeline that dramatically improves search precision by scoring query-document pairs jointly rather than independently."
canonical: https://callsphere.ai/blog/re-ranking-search-results-cross-encoders-retrieval-precision
category: "Learn Agentic AI"
tags: ["Cross-Encoder", "Re-Ranking", "Semantic Search", "Information Retrieval", "NLP"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T05:01:10.009Z
---

# Re-Ranking Search Results with Cross-Encoders: Improving Retrieval Precision

> Understand the difference between bi-encoders and cross-encoders, then build a re-ranking pipeline that dramatically improves search precision by scoring query-document pairs jointly rather than independently.

## The Precision Problem in First-Stage Retrieval

Bi-encoder models (like sentence-transformers) embed queries and documents independently, then compare them with cosine similarity. This independence is what makes them fast — you can pre-compute document embeddings — but it also limits their accuracy. A bi-encoder cannot model fine-grained interactions between specific query terms and specific document phrases.

Cross-encoders solve this by processing the query and document together as a single input pair, allowing the transformer's attention layers to directly compare every query token against every document token. The result is significantly higher precision, at the cost of speed.

## Bi-Encoder vs Cross-Encoder

The key architectural difference:

```mermaid
flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness
PromptFoo or Braintrust"]
    GOLD[("Golden set
200 tagged cases")]
    JUDGE["LLM as judge
plus regex graders"]
    SCORE["Aggregate score
and per slice"]
    GATE{"Score regress
more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
```

- **Bi-encoder**: Embeds query and document separately, compares with dot product. Fast (pre-compute docs), but lower precision.
- **Cross-encoder**: Concatenates query + document, passes through transformer together, outputs a single relevance score. Slow (must run for each pair), but much higher precision.

The standard pattern is a two-stage pipeline: use a bi-encoder to retrieve the top 50-100 candidates quickly, then re-rank those candidates with a cross-encoder.

## Building the Re-Ranking Pipeline

```python
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from typing import List, Dict, Tuple

class TwoStageSearchPipeline:
    def __init__(
        self,
        bi_encoder_name: str = "all-MiniLM-L6-v2",
        cross_encoder_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
    ):
        self.bi_encoder = SentenceTransformer(bi_encoder_name)
        self.cross_encoder = CrossEncoder(cross_encoder_name)
        self.doc_embeddings = None
        self.documents = []

    def index_documents(self, documents: List[Dict]):
        """Pre-compute bi-encoder embeddings for all documents."""
        self.documents = documents
        texts = [f"{d['title']}. {d['body']}" for d in documents]
        self.doc_embeddings = self.bi_encoder.encode(
            texts, normalize_embeddings=True, show_progress_bar=True
        )

    def first_stage_retrieve(
        self, query: str, top_k: int = 50
    ) -> List[Tuple[int, float]]:
        """Fast retrieval using bi-encoder similarity."""
        query_emb = self.bi_encoder.encode(
            [query], normalize_embeddings=True
        )
        scores = np.dot(self.doc_embeddings, query_emb.T).flatten()
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [(idx, scores[idx]) for idx in top_indices]

    def re_rank(
        self, query: str, candidates: List[Tuple[int, float]], top_k: int = 10
    ) -> List[Dict]:
        """Re-rank candidates using cross-encoder."""
        pairs = []
        for idx, _ in candidates:
            doc = self.documents[idx]
            text = f"{doc['title']}. {doc['body']}"
            pairs.append((query, text))

        # Cross-encoder scores all pairs jointly
        ce_scores = self.cross_encoder.predict(pairs)

        # Sort by cross-encoder score
        scored = list(zip(candidates, ce_scores))
        scored.sort(key=lambda x: x[1], reverse=True)

        results = []
        for (idx, bi_score), ce_score in scored[:top_k]:
            doc = self.documents[idx].copy()
            doc["bi_encoder_score"] = float(bi_score)
            doc["cross_encoder_score"] = float(ce_score)
            results.append(doc)
        return results

    def search(self, query: str, retrieve_k: int = 50, final_k: int = 10):
        candidates = self.first_stage_retrieve(query, top_k=retrieve_k)
        return self.re_rank(query, candidates, top_k=final_k)
```

## Choosing the Right Cross-Encoder Model

Model selection depends on your latency budget:

```python
# Model comparison (approximate, on CPU)
CROSS_ENCODER_MODELS = {
    # Model name: (params, ms/pair, nDCG@10 on MS MARCO)
    "cross-encoder/ms-marco-TinyBERT-L-2-v2": ("4.4M", 1.5, 0.325),
    "cross-encoder/ms-marco-MiniLM-L-6-v2": ("22.7M", 4.0, 0.349),
    "cross-encoder/ms-marco-MiniLM-L-12-v2": ("33.4M", 8.0, 0.357),
    "cross-encoder/ms-marco-electra-base": ("109M", 12.0, 0.365),
}

def select_model(latency_budget_ms: float, num_candidates: int) -> str:
    """Select the best model that fits within the latency budget."""
    for name, (params, ms_per_pair, quality) in sorted(
        CROSS_ENCODER_MODELS.items(),
        key=lambda x: x[1][2],
        reverse=True,  # prefer higher quality
    ):
        total_latency = ms_per_pair * num_candidates
        if total_latency  str:
        combined = f"{query}|||{doc_text}"
        return hashlib.md5(combined.encode()).hexdigest()

    def predict(self, pairs: list) -> list:
        scores = []
        uncached_pairs = []
        uncached_indices = []
        for i, (query, doc) in enumerate(pairs):
            key = self._cache_key(query, doc)
            if key in self._cache:
                scores.append(self._cache[key])
            else:
                scores.append(None)
                uncached_pairs.append((query, doc))
                uncached_indices.append(i)

        if uncached_pairs:
            new_scores = self.cross_encoder.predict(uncached_pairs)
            for idx, score in zip(uncached_indices, new_scores):
                key = self._cache_key(*pairs[idx])
                self._cache[key] = float(score)
                scores[idx] = float(score)

        return scores
```

## Measuring the Impact

Re-ranking typically improves nDCG@10 by 15-30% over bi-encoder-only retrieval. The improvement is most pronounced for ambiguous or complex queries where surface-level similarity is misleading.

## FAQ

### When should I skip re-ranking and use only a bi-encoder?

Skip re-ranking when latency is critical (under 50ms), when your corpus is small enough that a flat exact search is already precise, or when queries are simple keyword lookups. Re-ranking shines on natural language questions and long-form queries where nuance matters.

### Can I fine-tune a cross-encoder on my own data?

Yes, and it is one of the highest-impact improvements you can make. Collect query-document relevance pairs from click logs or manual annotations. Even 1,000-2,000 labeled pairs can significantly boost domain-specific precision. Use the `sentence-transformers` training API with `CrossEncoder.fit()`.

### How many candidates should the first stage retrieve for re-ranking?

Start with 50 candidates. Going beyond 100 rarely improves final results because relevant documents almost always appear in the top 50 of a decent bi-encoder. Profile your pipeline to find the sweet spot between recall and re-ranking latency.

---

#CrossEncoder #ReRanking #SemanticSearch #InformationRetrieval #NLP #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/re-ranking-search-results-cross-encoders-retrieval-precision
