Re-Ranking Search Results with Cross-Encoders: Improving Retrieval Precision
Understand the difference between bi-encoders and cross-encoders, then build a re-ranking pipeline that dramatically improves search precision by scoring query-document pairs jointly rather than independently.
The Precision Problem in First-Stage Retrieval
Bi-encoder models (like sentence-transformers) embed queries and documents independently, then compare them with cosine similarity. This independence is what makes them fast — you can pre-compute document embeddings — but it also limits their accuracy. A bi-encoder cannot model fine-grained interactions between specific query terms and specific document phrases.
Cross-encoders solve this by processing the query and document together as a single input pair, allowing the transformer's attention layers to directly compare every query token against every document token. The result is significantly higher precision, at the cost of speed.
Bi-Encoder vs Cross-Encoder
The key architectural difference:
flowchart TD
START["Re-Ranking Search Results with Cross-Encoders: Im…"] --> A
A["The Precision Problem in First-Stage Re…"]
A --> B
B["Bi-Encoder vs Cross-Encoder"]
B --> C
C["Building the Re-Ranking Pipeline"]
C --> D
D["Choosing the Right Cross-Encoder Model"]
D --> E
E["Managing Latency"]
E --> F
F["Measuring the Impact"]
F --> G
G["FAQ"]
G --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
- Bi-encoder: Embeds query and document separately, compares with dot product. Fast (pre-compute docs), but lower precision.
- Cross-encoder: Concatenates query + document, passes through transformer together, outputs a single relevance score. Slow (must run for each pair), but much higher precision.
The standard pattern is a two-stage pipeline: use a bi-encoder to retrieve the top 50-100 candidates quickly, then re-rank those candidates with a cross-encoder.
Building the Re-Ranking Pipeline
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from typing import List, Dict, Tuple
class TwoStageSearchPipeline:
def __init__(
self,
bi_encoder_name: str = "all-MiniLM-L6-v2",
cross_encoder_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
):
self.bi_encoder = SentenceTransformer(bi_encoder_name)
self.cross_encoder = CrossEncoder(cross_encoder_name)
self.doc_embeddings = None
self.documents = []
def index_documents(self, documents: List[Dict]):
"""Pre-compute bi-encoder embeddings for all documents."""
self.documents = documents
texts = [f"{d['title']}. {d['body']}" for d in documents]
self.doc_embeddings = self.bi_encoder.encode(
texts, normalize_embeddings=True, show_progress_bar=True
)
def first_stage_retrieve(
self, query: str, top_k: int = 50
) -> List[Tuple[int, float]]:
"""Fast retrieval using bi-encoder similarity."""
query_emb = self.bi_encoder.encode(
[query], normalize_embeddings=True
)
scores = np.dot(self.doc_embeddings, query_emb.T).flatten()
top_indices = np.argsort(scores)[::-1][:top_k]
return [(idx, scores[idx]) for idx in top_indices]
def re_rank(
self, query: str, candidates: List[Tuple[int, float]], top_k: int = 10
) -> List[Dict]:
"""Re-rank candidates using cross-encoder."""
pairs = []
for idx, _ in candidates:
doc = self.documents[idx]
text = f"{doc['title']}. {doc['body']}"
pairs.append((query, text))
# Cross-encoder scores all pairs jointly
ce_scores = self.cross_encoder.predict(pairs)
# Sort by cross-encoder score
scored = list(zip(candidates, ce_scores))
scored.sort(key=lambda x: x[1], reverse=True)
results = []
for (idx, bi_score), ce_score in scored[:top_k]:
doc = self.documents[idx].copy()
doc["bi_encoder_score"] = float(bi_score)
doc["cross_encoder_score"] = float(ce_score)
results.append(doc)
return results
def search(self, query: str, retrieve_k: int = 50, final_k: int = 10):
candidates = self.first_stage_retrieve(query, top_k=retrieve_k)
return self.re_rank(query, candidates, top_k=final_k)
Choosing the Right Cross-Encoder Model
Model selection depends on your latency budget:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
flowchart TD
CENTER(("Core Concepts"))
CENTER --> N0["Reduce candidate count — retrieve 30-50…"]
CENTER --> N1["Use smaller models — TinyBERT at 1.5ms/…"]
CENTER --> N2["Batch on GPU — GPU batching drops per-p…"]
CENTER --> N3["Cache re-ranked results — popular queri…"]
style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
# Model comparison (approximate, on CPU)
CROSS_ENCODER_MODELS = {
# Model name: (params, ms/pair, nDCG@10 on MS MARCO)
"cross-encoder/ms-marco-TinyBERT-L-2-v2": ("4.4M", 1.5, 0.325),
"cross-encoder/ms-marco-MiniLM-L-6-v2": ("22.7M", 4.0, 0.349),
"cross-encoder/ms-marco-MiniLM-L-12-v2": ("33.4M", 8.0, 0.357),
"cross-encoder/ms-marco-electra-base": ("109M", 12.0, 0.365),
}
def select_model(latency_budget_ms: float, num_candidates: int) -> str:
"""Select the best model that fits within the latency budget."""
for name, (params, ms_per_pair, quality) in sorted(
CROSS_ENCODER_MODELS.items(),
key=lambda x: x[1][2],
reverse=True, # prefer higher quality
):
total_latency = ms_per_pair * num_candidates
if total_latency <= latency_budget_ms:
return name
return "cross-encoder/ms-marco-TinyBERT-L-2-v2" # fallback
Managing Latency
Cross-encoders are expensive. Re-ranking 100 candidates with a 12-layer model at 8ms per pair takes 800ms. Strategies to reduce this:
- Reduce candidate count — retrieve 30-50 instead of 100. Diminishing returns beyond the top 50.
- Use smaller models — TinyBERT at 1.5ms/pair re-ranks 50 candidates in 75ms.
- Batch on GPU — GPU batching drops per-pair time by 10x.
- Cache re-ranked results — popular queries hit the same documents repeatedly.
from functools import lru_cache
import hashlib
class CachedReRanker:
def __init__(self, cross_encoder: CrossEncoder, cache_size: int = 1024):
self.cross_encoder = cross_encoder
self._cache = {}
self.cache_size = cache_size
def _cache_key(self, query: str, doc_text: str) -> str:
combined = f"{query}|||{doc_text}"
return hashlib.md5(combined.encode()).hexdigest()
def predict(self, pairs: list) -> list:
scores = []
uncached_pairs = []
uncached_indices = []
for i, (query, doc) in enumerate(pairs):
key = self._cache_key(query, doc)
if key in self._cache:
scores.append(self._cache[key])
else:
scores.append(None)
uncached_pairs.append((query, doc))
uncached_indices.append(i)
if uncached_pairs:
new_scores = self.cross_encoder.predict(uncached_pairs)
for idx, score in zip(uncached_indices, new_scores):
key = self._cache_key(*pairs[idx])
self._cache[key] = float(score)
scores[idx] = float(score)
return scores
Measuring the Impact
Re-ranking typically improves nDCG@10 by 15-30% over bi-encoder-only retrieval. The improvement is most pronounced for ambiguous or complex queries where surface-level similarity is misleading.
FAQ
When should I skip re-ranking and use only a bi-encoder?
Skip re-ranking when latency is critical (under 50ms), when your corpus is small enough that a flat exact search is already precise, or when queries are simple keyword lookups. Re-ranking shines on natural language questions and long-form queries where nuance matters.
Can I fine-tune a cross-encoder on my own data?
Yes, and it is one of the highest-impact improvements you can make. Collect query-document relevance pairs from click logs or manual annotations. Even 1,000-2,000 labeled pairs can significantly boost domain-specific precision. Use the sentence-transformers training API with CrossEncoder.fit().
How many candidates should the first stage retrieve for re-ranking?
Start with 50 candidates. Going beyond 100 rarely improves final results because relevant documents almost always appear in the top 50 of a decent bi-encoder. Profile your pipeline to find the sweet spot between recall and re-ranking latency.
#CrossEncoder #ReRanking #SemanticSearch #InformationRetrieval #NLP #AgenticAI #LearnAI #AIEngineering
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.