TL;DR — On 2026 benchmarks, BM25 and dense vectors solve different problems. Hybrid (RRF or weighted) retrieval lifts recall@10 from ~65–78% single-mode to ~91% combined, and Hybrid + Cohere Rerank pushes Recall@5 from 0.587 (dense-only) to 0.816. For a voice agent that must answer in under 600ms, hybrid is not optional — it is the floor.

The technique

Hybrid retrieval runs two indexes in parallel: a sparse lexical index (BM25 or BM25F over a tokenized inverted file) and a dense vector index (HNSW over float embeddings). Each side returns its top-K, then a fusion step — usually Reciprocal Rank Fusion or a weighted sum after min-max normalization — merges the lists into a single ordering. The result captures exact-term hits (drug codes, SKU numbers, error strings) that dense models blur, and semantic hits (paraphrases, synonyms) that BM25 misses.

flowchart LR
  Q[Caller utterance] --> R[Query rewriter]
  R --> B[BM25 index]
  R --> D[Dense HNSW index]
  B --> F[RRF fusion k=60]
  D --> F
  F --> RR[Reranker top-50 to top-5]
  RR --> A[LLM agent]
  A --> V[Voice response]

How it works

The BM25 score is the classic Robertson-Sparck Jones formula with k1=1.2, b=0.75 defaults. The dense side runs a query embedding (e5-large, BGE-m3, or text-embedding-3-large) through HNSW with M=16, ef_search=64. RRF merges with score = sum(1 / (k + rank_i)), where k=60 is the default constant from the original Cormack 2009 paper. The fused list is then truncated and passed to a cross-encoder reranker for the final ordering.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

For voice, the latency budget is brutal: TTS startup needs the first retrieved chunk in under 250ms. That means BM25 on Postgres tsvector or OpenSearch (single-digit ms), HNSW with ef_search capped at 64 (15–30ms), and the reranker only running on the top-50 candidates. Anything more eats into the response window.

CallSphere implementation

CallSphere runs 37 specialist agents across 6 verticals, 90+ tools over 115+ Postgres tables. The UrackIT IT helpdesk uses ChromaDB-backed RAG over runbooks, KB articles, and ticket history; OneRoof real estate runs hybrid search over MLS listings and listing photos with a vision encoder; Healthcare retrieves over patient records, insurance plans, and provider directories. Every vertical uses a hybrid pipeline because exact-term matching of CPT codes, MLS IDs, and ticket IDs is non-negotiable.

Pricing is $149 / $499 / $1499 with a 14-day no-card trial and 22% affiliate. Try it on the trial page, see vertical fits on /industries/it-services and /industries/real-estate, or compare tiers on /pricing.

Build steps with code

Postgres BM25 via the pg_search extension or a tsvector column with GIN index.
pgvector HNSW on the same row for the dense side — single-row joins, no cross-database fan-out.
Query rewrite at the edge with a small Llama 3.1 8B to expand pronouns ("the patient" -> "patient ID 4421").
Fuse with RRF in the application layer.

def hybrid_search(q: str, k: int = 50):
    sparse = pg.execute(
      "SELECT id, ts_rank_cd(tsv, plainto_tsquery(%s)) AS s FROM kb "
      "WHERE tsv @@ plainto_tsquery(%s) ORDER BY s DESC LIMIT %s", (q, q, k))
    emb = embed(q)
    dense = pg.execute(
      "SELECT id, 1 - (embedding <=> %s) AS s FROM kb "
      "ORDER BY embedding <=> %s LIMIT %s", (emb, emb, k))
    return rrf_fuse(sparse, dense, k_const=60)[:10]

Rerank with Cohere Rerank 3.5 or a local BGE-reranker-v2-m3 on the fused top-50.
Cache repeated queries in Redis with a 60-second TTL — voice queries cluster.

Pitfalls

Tokenizer mismatch: BM25 stems "appointments" -> "appoint" while the embedder treats it as a unit. Run both through the same lower-case + punctuation strip pipeline.
RRF k constant: too low (k=10) over-rewards rank-1 from each side; too high (k=200) flattens the fusion. Stick near 60.
Dense-only on rare entities: SKUs, MLS IDs, drug NDC codes need exact match. If you skip BM25, expect 30–40% miss rates on these.
Latency creep: every reranker hop adds 80–150ms. Budget it before you ship.

FAQ

Do I need a managed vector DB? No — pgvector with HNSW handles 10M+ vectors comfortably for a single-tenant voice agent.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

RRF or weighted sum? RRF is more robust to score-distribution drift; weighted sum is faster if your scores are well-calibrated.

How does this play with long context? Hybrid feeds the long-context LLM the right 5–10 chunks. They are complementary, not substitutes.

What reranker? Cohere Rerank 3.5 if you can pay; BGE-reranker-v2-m3 if you self-host. Both clear ColBERT v2 on most BEIR tasks.

Does this help on the demo? Yes — the live demo runs hybrid by default for any vertical you pick.

Hybrid Retrieval for AI Voice: BM25 + Dense Embeddings in 2026

The technique

How it works

CallSphere implementation

Build steps with code

Pitfalls

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

Logistics Dispatch Voice Agent 2026: Driver Hotline + Load Assignment Hands-Free