Skip to content
Learn Agentic AI
Learn Agentic AI13 min read10 views

Re-Ranking Search Results with Cross-Encoders: Improving Retrieval Precision

Understand the difference between bi-encoders and cross-encoders, then build a re-ranking pipeline that dramatically improves search precision by scoring query-document pairs jointly rather than independently.

The Precision Problem in First-Stage Retrieval

Bi-encoder models (like sentence-transformers) embed queries and documents independently, then compare them with cosine similarity. This independence is what makes them fast — you can pre-compute document embeddings — but it also limits their accuracy. A bi-encoder cannot model fine-grained interactions between specific query terms and specific document phrases.

Cross-encoders solve this by processing the query and document together as a single input pair, allowing the transformer's attention layers to directly compare every query token against every document token. The result is significantly higher precision, at the cost of speed.

Bi-Encoder vs Cross-Encoder

The key architectural difference:

flowchart TD
    START["Re-Ranking Search Results with Cross-Encoders: Im…"] --> A
    A["The Precision Problem in First-Stage Re…"]
    A --> B
    B["Bi-Encoder vs Cross-Encoder"]
    B --> C
    C["Building the Re-Ranking Pipeline"]
    C --> D
    D["Choosing the Right Cross-Encoder Model"]
    D --> E
    E["Managing Latency"]
    E --> F
    F["Measuring the Impact"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
  • Bi-encoder: Embeds query and document separately, compares with dot product. Fast (pre-compute docs), but lower precision.
  • Cross-encoder: Concatenates query + document, passes through transformer together, outputs a single relevance score. Slow (must run for each pair), but much higher precision.

The standard pattern is a two-stage pipeline: use a bi-encoder to retrieve the top 50-100 candidates quickly, then re-rank those candidates with a cross-encoder.

Building the Re-Ranking Pipeline

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from typing import List, Dict, Tuple

class TwoStageSearchPipeline:
    def __init__(
        self,
        bi_encoder_name: str = "all-MiniLM-L6-v2",
        cross_encoder_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
    ):
        self.bi_encoder = SentenceTransformer(bi_encoder_name)
        self.cross_encoder = CrossEncoder(cross_encoder_name)
        self.doc_embeddings = None
        self.documents = []

    def index_documents(self, documents: List[Dict]):
        """Pre-compute bi-encoder embeddings for all documents."""
        self.documents = documents
        texts = [f"{d['title']}. {d['body']}" for d in documents]
        self.doc_embeddings = self.bi_encoder.encode(
            texts, normalize_embeddings=True, show_progress_bar=True
        )

    def first_stage_retrieve(
        self, query: str, top_k: int = 50
    ) -> List[Tuple[int, float]]:
        """Fast retrieval using bi-encoder similarity."""
        query_emb = self.bi_encoder.encode(
            [query], normalize_embeddings=True
        )
        scores = np.dot(self.doc_embeddings, query_emb.T).flatten()
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [(idx, scores[idx]) for idx in top_indices]

    def re_rank(
        self, query: str, candidates: List[Tuple[int, float]], top_k: int = 10
    ) -> List[Dict]:
        """Re-rank candidates using cross-encoder."""
        pairs = []
        for idx, _ in candidates:
            doc = self.documents[idx]
            text = f"{doc['title']}. {doc['body']}"
            pairs.append((query, text))

        # Cross-encoder scores all pairs jointly
        ce_scores = self.cross_encoder.predict(pairs)

        # Sort by cross-encoder score
        scored = list(zip(candidates, ce_scores))
        scored.sort(key=lambda x: x[1], reverse=True)

        results = []
        for (idx, bi_score), ce_score in scored[:top_k]:
            doc = self.documents[idx].copy()
            doc["bi_encoder_score"] = float(bi_score)
            doc["cross_encoder_score"] = float(ce_score)
            results.append(doc)
        return results

    def search(self, query: str, retrieve_k: int = 50, final_k: int = 10):
        candidates = self.first_stage_retrieve(query, top_k=retrieve_k)
        return self.re_rank(query, candidates, top_k=final_k)

Choosing the Right Cross-Encoder Model

Model selection depends on your latency budget:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Reduce candidate count — retrieve 30-50…"]
    CENTER --> N1["Use smaller models — TinyBERT at 1.5ms/…"]
    CENTER --> N2["Batch on GPU — GPU batching drops per-p…"]
    CENTER --> N3["Cache re-ranked results — popular queri…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
# Model comparison (approximate, on CPU)
CROSS_ENCODER_MODELS = {
    # Model name: (params, ms/pair, nDCG@10 on MS MARCO)
    "cross-encoder/ms-marco-TinyBERT-L-2-v2": ("4.4M", 1.5, 0.325),
    "cross-encoder/ms-marco-MiniLM-L-6-v2": ("22.7M", 4.0, 0.349),
    "cross-encoder/ms-marco-MiniLM-L-12-v2": ("33.4M", 8.0, 0.357),
    "cross-encoder/ms-marco-electra-base": ("109M", 12.0, 0.365),
}

def select_model(latency_budget_ms: float, num_candidates: int) -> str:
    """Select the best model that fits within the latency budget."""
    for name, (params, ms_per_pair, quality) in sorted(
        CROSS_ENCODER_MODELS.items(),
        key=lambda x: x[1][2],
        reverse=True,  # prefer higher quality
    ):
        total_latency = ms_per_pair * num_candidates
        if total_latency <= latency_budget_ms:
            return name
    return "cross-encoder/ms-marco-TinyBERT-L-2-v2"  # fallback

Managing Latency

Cross-encoders are expensive. Re-ranking 100 candidates with a 12-layer model at 8ms per pair takes 800ms. Strategies to reduce this:

  1. Reduce candidate count — retrieve 30-50 instead of 100. Diminishing returns beyond the top 50.
  2. Use smaller models — TinyBERT at 1.5ms/pair re-ranks 50 candidates in 75ms.
  3. Batch on GPU — GPU batching drops per-pair time by 10x.
  4. Cache re-ranked results — popular queries hit the same documents repeatedly.
from functools import lru_cache
import hashlib

class CachedReRanker:
    def __init__(self, cross_encoder: CrossEncoder, cache_size: int = 1024):
        self.cross_encoder = cross_encoder
        self._cache = {}
        self.cache_size = cache_size

    def _cache_key(self, query: str, doc_text: str) -> str:
        combined = f"{query}|||{doc_text}"
        return hashlib.md5(combined.encode()).hexdigest()

    def predict(self, pairs: list) -> list:
        scores = []
        uncached_pairs = []
        uncached_indices = []
        for i, (query, doc) in enumerate(pairs):
            key = self._cache_key(query, doc)
            if key in self._cache:
                scores.append(self._cache[key])
            else:
                scores.append(None)
                uncached_pairs.append((query, doc))
                uncached_indices.append(i)

        if uncached_pairs:
            new_scores = self.cross_encoder.predict(uncached_pairs)
            for idx, score in zip(uncached_indices, new_scores):
                key = self._cache_key(*pairs[idx])
                self._cache[key] = float(score)
                scores[idx] = float(score)

        return scores

Measuring the Impact

Re-ranking typically improves nDCG@10 by 15-30% over bi-encoder-only retrieval. The improvement is most pronounced for ambiguous or complex queries where surface-level similarity is misleading.

FAQ

When should I skip re-ranking and use only a bi-encoder?

Skip re-ranking when latency is critical (under 50ms), when your corpus is small enough that a flat exact search is already precise, or when queries are simple keyword lookups. Re-ranking shines on natural language questions and long-form queries where nuance matters.

Can I fine-tune a cross-encoder on my own data?

Yes, and it is one of the highest-impact improvements you can make. Collect query-document relevance pairs from click logs or manual annotations. Even 1,000-2,000 labeled pairs can significantly boost domain-specific precision. Use the sentence-transformers training API with CrossEncoder.fit().

How many candidates should the first stage retrieve for re-ranking?

Start with 50 candidates. Going beyond 100 rarely improves final results because relevant documents almost always appear in the top 50 of a decent bi-encoder. Profile your pipeline to find the sweet spot between recall and re-ranking latency.


#CrossEncoder #ReRanking #SemanticSearch #InformationRetrieval #NLP #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Post-Call Analytics with GPT-4o-mini: Sentiment, Lead Scoring, and Intent

Build a post-call analytics pipeline with GPT-4o-mini — sentiment, intent, lead scoring, satisfaction, and escalation detection.

Learn Agentic AI

Semantic Search for AI Agents: Embedding Models, Chunking Strategies, and Retrieval Optimization

Comprehensive guide to semantic search for AI agents covering embedding model selection, document chunking strategies, and retrieval optimization techniques for production systems.

Learn Agentic AI

Advanced RAG for AI Agents 2026: Hybrid Search, Re-Ranking, and Agentic Retrieval

Master advanced RAG patterns for AI agents including hybrid vector-keyword search, cross-encoder re-ranking, and agentic retrieval where agents autonomously decide retrieval strategy.

Learn Agentic AI

Building a Knowledge Graph Construction Agent: Extracting Entities and Relations from Documents

Build an AI agent that reads documents, extracts named entities and their relationships, constructs a knowledge graph stored in Neo4j, and provides a natural language query interface over the graph.

Learn Agentic AI

AI-Powered Document Comparison: Redline Generation and Change Tracking with Vision

Build an AI agent that compares two versions of a document, identifies additions, deletions, and modifications, generates visual redlines, and produces annotated change summaries for legal, contract, and policy review workflows.

Learn Agentic AI

Embeddings and Vector Representations: How LLMs Understand Meaning

Learn what embeddings are, how they capture semantic meaning as vectors, how to use embedding models for search and clustering, and the role cosine similarity plays in AI applications.