Skip to content
AI Engineering
AI Engineering11 min read0 views

Hybrid Retrieval for AI Voice: BM25 + Dense Embeddings in 2026

BM25 alone hits 65% recall@10. Dense alone hits 78%. The hybrid pipeline pushes 91% — and once you bolt on a reranker the gap widens. Here is how CallSphere wires hybrid retrieval into a voice loop with a 200ms budget.

TL;DR — On 2026 benchmarks, BM25 and dense vectors solve different problems. Hybrid (RRF or weighted) retrieval lifts recall@10 from ~65–78% single-mode to ~91% combined, and Hybrid + Cohere Rerank pushes Recall@5 from 0.587 (dense-only) to 0.816. For a voice agent that must answer in under 600ms, hybrid is not optional — it is the floor.

The technique

Hybrid retrieval runs two indexes in parallel: a sparse lexical index (BM25 or BM25F over a tokenized inverted file) and a dense vector index (HNSW over float embeddings). Each side returns its top-K, then a fusion step — usually Reciprocal Rank Fusion or a weighted sum after min-max normalization — merges the lists into a single ordering. The result captures exact-term hits (drug codes, SKU numbers, error strings) that dense models blur, and semantic hits (paraphrases, synonyms) that BM25 misses.

flowchart LR
  Q[Caller utterance] --> R[Query rewriter]
  R --> B[BM25 index]
  R --> D[Dense HNSW index]
  B --> F[RRF fusion k=60]
  D --> F
  F --> RR[Reranker top-50 to top-5]
  RR --> A[LLM agent]
  A --> V[Voice response]

How it works

The BM25 score is the classic Robertson-Sparck Jones formula with k1=1.2, b=0.75 defaults. The dense side runs a query embedding (e5-large, BGE-m3, or text-embedding-3-large) through HNSW with M=16, ef_search=64. RRF merges with score = sum(1 / (k + rank_i)), where k=60 is the default constant from the original Cormack 2009 paper. The fused list is then truncated and passed to a cross-encoder reranker for the final ordering.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

For voice, the latency budget is brutal: TTS startup needs the first retrieved chunk in under 250ms. That means BM25 on Postgres tsvector or OpenSearch (single-digit ms), HNSW with ef_search capped at 64 (15–30ms), and the reranker only running on the top-50 candidates. Anything more eats into the response window.

CallSphere implementation

CallSphere runs 37 specialist agents across 6 verticals, 90+ tools over 115+ Postgres tables. The UrackIT IT helpdesk uses ChromaDB-backed RAG over runbooks, KB articles, and ticket history; OneRoof real estate runs hybrid search over MLS listings and listing photos with a vision encoder; Healthcare retrieves over patient records, insurance plans, and provider directories. Every vertical uses a hybrid pipeline because exact-term matching of CPT codes, MLS IDs, and ticket IDs is non-negotiable.

Pricing is $149 / $499 / $1499 with a 14-day no-card trial and 22% affiliate. Try it on the trial page, see vertical fits on /industries/it-services and /industries/real-estate, or compare tiers on /pricing.

Build steps with code

  1. Postgres BM25 via the pg_search extension or a tsvector column with GIN index.
  2. pgvector HNSW on the same row for the dense side — single-row joins, no cross-database fan-out.
  3. Query rewrite at the edge with a small Llama 3.1 8B to expand pronouns ("the patient" -> "patient ID 4421").
  4. Fuse with RRF in the application layer.
def hybrid_search(q: str, k: int = 50):
    sparse = pg.execute(
      "SELECT id, ts_rank_cd(tsv, plainto_tsquery(%s)) AS s FROM kb "
      "WHERE tsv @@ plainto_tsquery(%s) ORDER BY s DESC LIMIT %s", (q, q, k))
    emb = embed(q)
    dense = pg.execute(
      "SELECT id, 1 - (embedding <=> %s) AS s FROM kb "
      "ORDER BY embedding <=> %s LIMIT %s", (emb, emb, k))
    return rrf_fuse(sparse, dense, k_const=60)[:10]
  1. Rerank with Cohere Rerank 3.5 or a local BGE-reranker-v2-m3 on the fused top-50.
  2. Cache repeated queries in Redis with a 60-second TTL — voice queries cluster.

Pitfalls

  • Tokenizer mismatch: BM25 stems "appointments" -> "appoint" while the embedder treats it as a unit. Run both through the same lower-case + punctuation strip pipeline.
  • RRF k constant: too low (k=10) over-rewards rank-1 from each side; too high (k=200) flattens the fusion. Stick near 60.
  • Dense-only on rare entities: SKUs, MLS IDs, drug NDC codes need exact match. If you skip BM25, expect 30–40% miss rates on these.
  • Latency creep: every reranker hop adds 80–150ms. Budget it before you ship.

FAQ

Do I need a managed vector DB? No — pgvector with HNSW handles 10M+ vectors comfortably for a single-tenant voice agent.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

RRF or weighted sum? RRF is more robust to score-distribution drift; weighted sum is faster if your scores are well-calibrated.

How does this play with long context? Hybrid feeds the long-context LLM the right 5–10 chunks. They are complementary, not substitutes.

What reranker? Cohere Rerank 3.5 if you can pay; BGE-reranker-v2-m3 if you self-host. Both clear ColBERT v2 on most BEIR tasks.

Does this help on the demo? Yes — the live demo runs hybrid by default for any vertical you pick.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

Agentic AI

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.

Agentic AI

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Logistics Dispatch Voice Agent 2026: Driver Hotline + Load Assignment Hands-Free

Trucking dispatchers spend half their day on check-calls. Here is how a 2026 AI voice agent runs the driver hotline, assigns loads, and updates the TMS in real time.