Hybrid Retrieval for AI Voice: BM25 + Dense Embeddings in 2026
BM25 alone hits 65% recall@10. Dense alone hits 78%. The hybrid pipeline pushes 91% — and once you bolt on a reranker the gap widens. Here is how CallSphere wires hybrid retrieval into a voice loop with a 200ms budget.
TL;DR — On 2026 benchmarks, BM25 and dense vectors solve different problems. Hybrid (RRF or weighted) retrieval lifts recall@10 from ~65–78% single-mode to ~91% combined, and Hybrid + Cohere Rerank pushes Recall@5 from 0.587 (dense-only) to 0.816. For a voice agent that must answer in under 600ms, hybrid is not optional — it is the floor.
The technique
Hybrid retrieval runs two indexes in parallel: a sparse lexical index (BM25 or BM25F over a tokenized inverted file) and a dense vector index (HNSW over float embeddings). Each side returns its top-K, then a fusion step — usually Reciprocal Rank Fusion or a weighted sum after min-max normalization — merges the lists into a single ordering. The result captures exact-term hits (drug codes, SKU numbers, error strings) that dense models blur, and semantic hits (paraphrases, synonyms) that BM25 misses.
flowchart LR
Q[Caller utterance] --> R[Query rewriter]
R --> B[BM25 index]
R --> D[Dense HNSW index]
B --> F[RRF fusion k=60]
D --> F
F --> RR[Reranker top-50 to top-5]
RR --> A[LLM agent]
A --> V[Voice response]
How it works
The BM25 score is the classic Robertson-Sparck Jones formula with k1=1.2, b=0.75 defaults. The dense side runs a query embedding (e5-large, BGE-m3, or text-embedding-3-large) through HNSW with M=16, ef_search=64. RRF merges with score = sum(1 / (k + rank_i)), where k=60 is the default constant from the original Cormack 2009 paper. The fused list is then truncated and passed to a cross-encoder reranker for the final ordering.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
For voice, the latency budget is brutal: TTS startup needs the first retrieved chunk in under 250ms. That means BM25 on Postgres tsvector or OpenSearch (single-digit ms), HNSW with ef_search capped at 64 (15–30ms), and the reranker only running on the top-50 candidates. Anything more eats into the response window.
CallSphere implementation
CallSphere runs 37 specialist agents across 6 verticals, 90+ tools over 115+ Postgres tables. The UrackIT IT helpdesk uses ChromaDB-backed RAG over runbooks, KB articles, and ticket history; OneRoof real estate runs hybrid search over MLS listings and listing photos with a vision encoder; Healthcare retrieves over patient records, insurance plans, and provider directories. Every vertical uses a hybrid pipeline because exact-term matching of CPT codes, MLS IDs, and ticket IDs is non-negotiable.
Pricing is $149 / $499 / $1499 with a 14-day no-card trial and 22% affiliate. Try it on the trial page, see vertical fits on /industries/it-services and /industries/real-estate, or compare tiers on /pricing.
Build steps with code
- Postgres BM25 via the
pg_searchextension or a tsvector column with GIN index. - pgvector HNSW on the same row for the dense side — single-row joins, no cross-database fan-out.
- Query rewrite at the edge with a small Llama 3.1 8B to expand pronouns ("the patient" -> "patient ID 4421").
- Fuse with RRF in the application layer.
def hybrid_search(q: str, k: int = 50):
sparse = pg.execute(
"SELECT id, ts_rank_cd(tsv, plainto_tsquery(%s)) AS s FROM kb "
"WHERE tsv @@ plainto_tsquery(%s) ORDER BY s DESC LIMIT %s", (q, q, k))
emb = embed(q)
dense = pg.execute(
"SELECT id, 1 - (embedding <=> %s) AS s FROM kb "
"ORDER BY embedding <=> %s LIMIT %s", (emb, emb, k))
return rrf_fuse(sparse, dense, k_const=60)[:10]
- Rerank with Cohere Rerank 3.5 or a local BGE-reranker-v2-m3 on the fused top-50.
- Cache repeated queries in Redis with a 60-second TTL — voice queries cluster.
Pitfalls
- Tokenizer mismatch: BM25 stems "appointments" -> "appoint" while the embedder treats it as a unit. Run both through the same lower-case + punctuation strip pipeline.
- RRF k constant: too low (k=10) over-rewards rank-1 from each side; too high (k=200) flattens the fusion. Stick near 60.
- Dense-only on rare entities: SKUs, MLS IDs, drug NDC codes need exact match. If you skip BM25, expect 30–40% miss rates on these.
- Latency creep: every reranker hop adds 80–150ms. Budget it before you ship.
FAQ
Do I need a managed vector DB? No — pgvector with HNSW handles 10M+ vectors comfortably for a single-tenant voice agent.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
RRF or weighted sum? RRF is more robust to score-distribution drift; weighted sum is faster if your scores are well-calibrated.
How does this play with long context? Hybrid feeds the long-context LLM the right 5–10 chunks. They are complementary, not substitutes.
What reranker? Cohere Rerank 3.5 if you can pay; BGE-reranker-v2-m3 if you self-host. Both clear ColBERT v2 on most BEIR tasks.
Does this help on the demo? Yes — the live demo runs hybrid by default for any vertical you pick.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.