---
title: "Hybrid Retrieval for AI Voice: BM25 + Dense Embeddings in 2026"
description: "BM25 alone hits 65% recall@10. Dense alone hits 78%. The hybrid pipeline pushes 91% — and once you bolt on a reranker the gap widens. Here is how CallSphere wires hybrid retrieval into a voice loop with a 200ms budget."
canonical: https://callsphere.ai/blog/vw6g-hybrid-retrieval-bm25-dense-voice-ai-2026
category: "AI Engineering"
tags: ["RAG", "Hybrid Search", "BM25", "Voice AI", "Retrieval"]
author: "CallSphere Team"
published: 2026-03-15T00:00:00.000Z
updated: 2026-05-07T16:46:09.785Z
---

# Hybrid Retrieval for AI Voice: BM25 + Dense Embeddings in 2026

> BM25 alone hits 65% recall@10. Dense alone hits 78%. The hybrid pipeline pushes 91% — and once you bolt on a reranker the gap widens. Here is how CallSphere wires hybrid retrieval into a voice loop with a 200ms budget.

> **TL;DR** — On 2026 benchmarks, BM25 and dense vectors solve different problems. Hybrid (RRF or weighted) retrieval lifts recall@10 from ~65–78% single-mode to ~91% combined, and Hybrid + Cohere Rerank pushes Recall@5 from 0.587 (dense-only) to 0.816. For a voice agent that must answer in under 600ms, hybrid is not optional — it is the floor.

## The technique

Hybrid retrieval runs two indexes in parallel: a sparse lexical index (BM25 or BM25F over a tokenized inverted file) and a dense vector index (HNSW over float embeddings). Each side returns its top-K, then a fusion step — usually Reciprocal Rank Fusion or a weighted sum after min-max normalization — merges the lists into a single ordering. The result captures exact-term hits (drug codes, SKU numbers, error strings) that dense models blur, and semantic hits (paraphrases, synonyms) that BM25 misses.

```mermaid
flowchart LR
  Q[Caller utterance] --> R[Query rewriter]
  R --> B[BM25 index]
  R --> D[Dense HNSW index]
  B --> F[RRF fusion k=60]
  D --> F
  F --> RR[Reranker top-50 to top-5]
  RR --> A[LLM agent]
  A --> V[Voice response]
```

## How it works

The BM25 score is the classic Robertson-Sparck Jones formula with k1=1.2, b=0.75 defaults. The dense side runs a query embedding (e5-large, BGE-m3, or text-embedding-3-large) through HNSW with M=16, ef_search=64. RRF merges with score = sum(1 / (k + rank_i)), where k=60 is the default constant from the original Cormack 2009 paper. The fused list is then truncated and passed to a cross-encoder reranker for the final ordering.

For voice, the latency budget is brutal: TTS startup needs the first retrieved chunk in under 250ms. That means BM25 on Postgres tsvector or OpenSearch (single-digit ms), HNSW with ef_search capped at 64 (15–30ms), and the reranker only running on the top-50 candidates. Anything more eats into the response window.

## CallSphere implementation

CallSphere runs **37 specialist agents** across **6 verticals**, **90+ tools** over **115+ Postgres tables**. The UrackIT IT helpdesk uses ChromaDB-backed RAG over runbooks, KB articles, and ticket history; OneRoof real estate runs hybrid search over MLS listings and listing photos with a vision encoder; Healthcare retrieves over patient records, insurance plans, and provider directories. Every vertical uses a hybrid pipeline because exact-term matching of CPT codes, MLS IDs, and ticket IDs is non-negotiable.

Pricing is **$149 / $499 / $1499** with a **14-day no-card trial** and **22% affiliate**. Try it on the [trial page](/trial), see vertical fits on [/industries/it-services](/industries/it-services) and [/industries/real-estate](/industries/real-estate), or compare tiers on [/pricing](/pricing).

## Build steps with code

1. **Postgres BM25** via the `pg_search` extension or a tsvector column with GIN index.
2. **pgvector HNSW** on the same row for the dense side — single-row joins, no cross-database fan-out.
3. **Query rewrite** at the edge with a small Llama 3.1 8B to expand pronouns ("the patient" -> "patient ID 4421").
4. **Fuse with RRF** in the application layer.

```python
def hybrid_search(q: str, k: int = 50):
    sparse = pg.execute(
      "SELECT id, ts_rank_cd(tsv, plainto_tsquery(%s)) AS s FROM kb "
      "WHERE tsv @@ plainto_tsquery(%s) ORDER BY s DESC LIMIT %s", (q, q, k))
    emb = embed(q)
    dense = pg.execute(
      "SELECT id, 1 - (embedding  %s) AS s FROM kb "
      "ORDER BY embedding  %s LIMIT %s", (emb, emb, k))
    return rrf_fuse(sparse, dense, k_const=60)[:10]
```

1. **Rerank** with Cohere Rerank 3.5 or a local BGE-reranker-v2-m3 on the fused top-50.
2. **Cache** repeated queries in Redis with a 60-second TTL — voice queries cluster.

## Pitfalls

- **Tokenizer mismatch**: BM25 stems "appointments" -> "appoint" while the embedder treats it as a unit. Run both through the same lower-case + punctuation strip pipeline.
- **RRF k constant**: too low (k=10) over-rewards rank-1 from each side; too high (k=200) flattens the fusion. Stick near 60.
- **Dense-only on rare entities**: SKUs, MLS IDs, drug NDC codes need exact match. If you skip BM25, expect 30–40% miss rates on these.
- **Latency creep**: every reranker hop adds 80–150ms. Budget it before you ship.

## FAQ

**Do I need a managed vector DB?** No — pgvector with HNSW handles 10M+ vectors comfortably for a single-tenant voice agent.

**RRF or weighted sum?** RRF is more robust to score-distribution drift; weighted sum is faster if your scores are well-calibrated.

**How does this play with long context?** Hybrid feeds the long-context LLM the right 5–10 chunks. They are complementary, not substitutes.

**What reranker?** Cohere Rerank 3.5 if you can pay; BGE-reranker-v2-m3 if you self-host. Both clear ColBERT v2 on most BEIR tasks.

**Does this help on the demo?** Yes — the live demo runs hybrid by default for any vertical you pick.

## Sources

- [Hybrid Search for RAG: BM25, SPLADE, and Vector Search Combined - Prem AI](https://blog.premai.io/hybrid-search-for-rag-bm25-splade-and-vector-search-combined/)
- [Better RAG Accuracy with Hybrid BM25 + Dense Vector Search - Medium](https://medium.com/@pbronck/better-rag-accuracy-with-hybrid-bm25-dense-vector-search-ea99d48cba93)
- [From BM25 to Corrective RAG benchmark - arXiv 2604.01733](https://arxiv.org/html/2604.01733v1)
- [Hybrid Search Guide April 2026 - Supermemory](https://supermemory.ai/blog/hybrid-search-guide/)

---

Source: https://callsphere.ai/blog/vw6g-hybrid-retrieval-bm25-dense-voice-ai-2026
