Re-Ranking Search Results with Cross-Encoders: Improving Retrieval Precision
By Sagar Shankaran, Founder of CallSphere
Understand the difference between bi-encoders and cross-encoders, then build a re-ranking pipeline that dramatically improves search precision by scoring query-document pairs jointly rather than independently.
Key takeaways
The Precision Problem in First-Stage Retrieval
Bi-encoder models (like sentence-transformers) embed queries and documents independently, then compare them with cosine similarity. This independence is what makes them fast — you can pre-compute document embeddings — but it also limits their accuracy. A bi-encoder cannot model fine-grained interactions between specific query terms and specific document phrases.
Cross-encoders solve this by processing the query and document together as a single input pair, allowing the transformer's attention layers to directly compare every query token against every document token. The result is significantly higher precision, at the cost of speed.
Bi-Encoder vs Cross-Encoder
The key architectural difference:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart LR
PR(["PR opened"])
UNIT["Unit tests"]
EVAL["Eval harness<br/>PromptFoo or Braintrust"]
GOLD[("Golden set<br/>200 tagged cases")]
JUDGE["LLM as judge<br/>plus regex graders"]
SCORE["Aggregate score<br/>and per slice"]
GATE{"Score regress<br/>more than 2 percent?"}
BLOCK(["Block merge"])
MERGE(["Merge to main"])
PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
GATE -->|Yes| BLOCK
GATE -->|No| MERGE
style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
style MERGE fill:#059669,stroke:#047857,color:#fff
- Bi-encoder: Embeds query and document separately, compares with dot product. Fast (pre-compute docs), but lower precision.
- Cross-encoder: Concatenates query + document, passes through transformer together, outputs a single relevance score. Slow (must run for each pair), but much higher precision.
The standard pattern is a two-stage pipeline: use a bi-encoder to retrieve the top 50-100 candidates quickly, then re-rank those candidates with a cross-encoder.
Building the Re-Ranking Pipeline
from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from typing import List, Dict, Tuple
class TwoStageSearchPipeline:
def __init__(
self,
bi_encoder_name: str = "all-MiniLM-L6-v2",
cross_encoder_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
):
self.bi_encoder = SentenceTransformer(bi_encoder_name)
self.cross_encoder = CrossEncoder(cross_encoder_name)
self.doc_embeddings = None
self.documents = []
def index_documents(self, documents: List[Dict]):
"""Pre-compute bi-encoder embeddings for all documents."""
self.documents = documents
texts = [f"{d['title']}. {d['body']}" for d in documents]
self.doc_embeddings = self.bi_encoder.encode(
texts, normalize_embeddings=True, show_progress_bar=True
)
def first_stage_retrieve(
self, query: str, top_k: int = 50
) -> List[Tuple[int, float]]:
"""Fast retrieval using bi-encoder similarity."""
query_emb = self.bi_encoder.encode(
[query], normalize_embeddings=True
)
scores = np.dot(self.doc_embeddings, query_emb.T).flatten()
top_indices = np.argsort(scores)[::-1][:top_k]
return [(idx, scores[idx]) for idx in top_indices]
def re_rank(
self, query: str, candidates: List[Tuple[int, float]], top_k: int = 10
) -> List[Dict]:
"""Re-rank candidates using cross-encoder."""
pairs = []
for idx, _ in candidates:
doc = self.documents[idx]
text = f"{doc['title']}. {doc['body']}"
pairs.append((query, text))
# Cross-encoder scores all pairs jointly
ce_scores = self.cross_encoder.predict(pairs)
# Sort by cross-encoder score
scored = list(zip(candidates, ce_scores))
scored.sort(key=lambda x: x[1], reverse=True)
results = []
for (idx, bi_score), ce_score in scored[:top_k]:
doc = self.documents[idx].copy()
doc["bi_encoder_score"] = float(bi_score)
doc["cross_encoder_score"] = float(ce_score)
results.append(doc)
return results
def search(self, query: str, retrieve_k: int = 50, final_k: int = 10):
candidates = self.first_stage_retrieve(query, top_k=retrieve_k)
return self.re_rank(query, candidates, top_k=final_k)
Choosing the Right Cross-Encoder Model
Model selection depends on your latency budget:
# Model comparison (approximate, on CPU)
CROSS_ENCODER_MODELS = {
# Model name: (params, ms/pair, nDCG@10 on MS MARCO)
"cross-encoder/ms-marco-TinyBERT-L-2-v2": ("4.4M", 1.5, 0.325),
"cross-encoder/ms-marco-MiniLM-L-6-v2": ("22.7M", 4.0, 0.349),
"cross-encoder/ms-marco-MiniLM-L-12-v2": ("33.4M", 8.0, 0.357),
"cross-encoder/ms-marco-electra-base": ("109M", 12.0, 0.365),
}
def select_model(latency_budget_ms: float, num_candidates: int) -> str:
"""Select the best model that fits within the latency budget."""
for name, (params, ms_per_pair, quality) in sorted(
CROSS_ENCODER_MODELS.items(),
key=lambda x: x[1][2],
reverse=True, # prefer higher quality
):
total_latency = ms_per_pair * num_candidates
if total_latency <= latency_budget_ms:
return name
return "cross-encoder/ms-marco-TinyBERT-L-2-v2" # fallback
Managing Latency
Cross-encoders are expensive. Re-ranking 100 candidates with a 12-layer model at 8ms per pair takes 800ms. Strategies to reduce this:
- Reduce candidate count — retrieve 30-50 instead of 100. Diminishing returns beyond the top 50.
- Use smaller models — TinyBERT at 1.5ms/pair re-ranks 50 candidates in 75ms.
- Batch on GPU — GPU batching drops per-pair time by 10x.
- Cache re-ranked results — popular queries hit the same documents repeatedly.
from functools import lru_cache
import hashlib
class CachedReRanker:
def __init__(self, cross_encoder: CrossEncoder, cache_size: int = 1024):
self.cross_encoder = cross_encoder
self._cache = {}
self.cache_size = cache_size
def _cache_key(self, query: str, doc_text: str) -> str:
combined = f"{query}|||{doc_text}"
return hashlib.md5(combined.encode()).hexdigest()
def predict(self, pairs: list) -> list:
scores = []
uncached_pairs = []
uncached_indices = []
for i, (query, doc) in enumerate(pairs):
key = self._cache_key(query, doc)
if key in self._cache:
scores.append(self._cache[key])
else:
scores.append(None)
uncached_pairs.append((query, doc))
uncached_indices.append(i)
if uncached_pairs:
new_scores = self.cross_encoder.predict(uncached_pairs)
for idx, score in zip(uncached_indices, new_scores):
key = self._cache_key(*pairs[idx])
self._cache[key] = float(score)
scores[idx] = float(score)
return scores
Measuring the Impact
Re-ranking typically improves nDCG@10 by 15-30% over bi-encoder-only retrieval. The improvement is most pronounced for ambiguous or complex queries where surface-level similarity is misleading.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
FAQ
When should I skip re-ranking and use only a bi-encoder?
Skip re-ranking when latency is critical (under 50ms), when your corpus is small enough that a flat exact search is already precise, or when queries are simple keyword lookups. Re-ranking shines on natural language questions and long-form queries where nuance matters.
Can I fine-tune a cross-encoder on my own data?
Yes, and it is one of the highest-impact improvements you can make. Collect query-document relevance pairs from click logs or manual annotations. Even 1,000-2,000 labeled pairs can significantly boost domain-specific precision. Use the sentence-transformers training API with CrossEncoder.fit().
How many candidates should the first stage retrieve for re-ranking?
Start with 50 candidates. Going beyond 100 rarely improves final results because relevant documents almost always appear in the top 50 of a decent bi-encoder. Profile your pipeline to find the sweet spot between recall and re-ranking latency.
#CrossEncoder #ReRanking #SemanticSearch #InformationRetrieval #NLP #AgenticAI #LearnAI #AIEngineering
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.