---
title: "Hybrid Search for RAG: Combining Vector Similarity with Keyword Search"
description: "Learn how to implement hybrid search for RAG by combining BM25 keyword search with vector similarity, using reciprocal rank fusion and re-ranking to maximize retrieval quality."
canonical: https://callsphere.ai/blog/hybrid-search-rag-vector-similarity-keyword-bm25
category: "Learn Agentic AI"
tags: ["RAG", "Hybrid Search", "BM25", "Vector Search", "Re-ranking", "Information Retrieval"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:42.601Z
---

# Hybrid Search for RAG: Combining Vector Similarity with Keyword Search

> Learn how to implement hybrid search for RAG by combining BM25 keyword search with vector similarity, using reciprocal rank fusion and re-ranking to maximize retrieval quality.

## Why Vector Search Alone Is Not Enough

Vector search excels at finding semantically similar content — it knows that "automobile" and "car" are related even though they share no characters. But it has blind spots. When a user searches for a specific error code like `ERR_SSL_PROTOCOL_ERROR`, an exact product name like `iPhone 15 Pro Max`, or an acronym like `HIPAA`, vector similarity can miss the exact match in favor of semantically similar but incorrect results.

Keyword search (BM25) excels at exact matching but fails on semantic understanding. It would not connect "how to terminate an employee" with a document titled "staff separation procedures."

Hybrid search combines both approaches, covering each method's weaknesses with the other's strengths. Production RAG systems at companies like Anthropic, Google, and Microsoft almost universally use hybrid retrieval.

## BM25: The Keyword Search Foundation

BM25 (Best Match 25) is a probabilistic ranking function that scores documents based on term frequency, inverse document frequency, and document length normalization:

```mermaid
flowchart LR
    Q(["User query"])
    REWRITE["Query rewrite
HyDE plus expansion"]
    HYBRID{"Hybrid search"}
    BM25["BM25 keyword
Postgres FTS"]
    DENSE["Dense vector
ANN search"]
    FUSE["Reciprocal rank
fusion"]
    RERANK["Cross encoder
reranker"]
    PACK["Context packing
and dedupe"]
    LLM["LLM generation"]
    OUT(["Cited answer"])
    Q --> REWRITE --> HYBRID
    HYBRID --> BM25 --> FUSE
    HYBRID --> DENSE --> FUSE
    FUSE --> RERANK --> PACK --> LLM --> OUT
    style HYBRID fill:#f59e0b,stroke:#d97706,color:#1f2937
    style RERANK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
from rank_bm25 import BM25Okapi
import re

def tokenize(text: str) -> list[str]:
    """Simple whitespace + lowercase tokenizer."""
    return re.findall(r"\w+", text.lower())

# Index documents
documents = [
    "Enterprise refund policy allows full refunds within 30 days",
    "HIPAA compliance checklist for healthcare data processing",
    "Staff separation procedures and exit interview guidelines",
    "ERR_SSL_PROTOCOL_ERROR troubleshooting for nginx servers",
]

tokenized_docs = [tokenize(doc) for doc in documents]
bm25 = BM25Okapi(tokenized_docs)

# Search
query = "ERR_SSL_PROTOCOL_ERROR"
scores = bm25.get_scores(tokenize(query))

for doc, score in sorted(zip(documents, scores), key=lambda x: -x[1]):
    if score > 0:
        print(f"[BM25: {score:.2f}] {doc}")
```

BM25 finds the exact error code match immediately, something vector search might rank lower.

## Implementing Hybrid Search from Scratch

Here is a complete hybrid search implementation that combines Chroma vector search with BM25:

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from rank_bm25 import BM25Okapi
import numpy as np
from dataclasses import dataclass

@dataclass
class SearchResult:
    content: str
    metadata: dict
    score: float
    source: str  # "vector", "bm25", or "both"

class HybridRetriever:
    def __init__(self, documents: list[dict], persist_dir: str = "./hybrid_db"):
        self.documents = documents
        texts = [d["content"] for d in documents]

        # Build vector index
        embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Chroma.from_texts(
            texts=texts,
            embedding=embeddings,
            metadatas=[d.get("metadata", {}) for d in documents],
            persist_directory=persist_dir,
        )

        # Build BM25 index
        self.tokenized_docs = [self._tokenize(t) for t in texts]
        self.bm25 = BM25Okapi(self.tokenized_docs)
        self.raw_texts = texts

    def _tokenize(self, text: str) -> list[str]:
        import re
        return re.findall(r"\w+", text.lower())

    def search(self, query: str, k: int = 5, alpha: float = 0.7) -> list[SearchResult]:
        """
        Hybrid search with reciprocal rank fusion.
        alpha: weight for vector search (1-alpha for BM25)
        """
        # Vector search
        vector_results = self.vectorstore.similarity_search_with_score(query, k=k*2)

        # BM25 search
        bm25_scores = self.bm25.get_scores(self._tokenize(query))
        bm25_ranked = np.argsort(bm25_scores)[::-1][:k*2]

        # Reciprocal Rank Fusion
        rrf_scores = {}
        rrf_constant = 60  # standard RRF constant

        # Score vector results
        for rank, (doc, _score) in enumerate(vector_results):
            doc_key = doc.page_content
            rrf_scores[doc_key] = rrf_scores.get(doc_key, 0)
            rrf_scores[doc_key] += alpha * (1 / (rrf_constant + rank + 1))

        # Score BM25 results
        for rank, doc_idx in enumerate(bm25_ranked):
            if bm25_scores[doc_idx] > 0:
                doc_key = self.raw_texts[doc_idx]
                rrf_scores[doc_key] = rrf_scores.get(doc_key, 0)
                rrf_scores[doc_key] += (1 - alpha) * (1 / (rrf_constant + rank + 1))

        # Sort by combined score and return top k
        sorted_results = sorted(rrf_scores.items(), key=lambda x: -x[1])[:k]
        return [
            SearchResult(content=text, metadata={}, score=score, source="hybrid")
            for text, score in sorted_results
        ]
```

## Reciprocal Rank Fusion Explained

RRF combines ranked lists from different retrieval methods without requiring score normalization. The formula for each document is:

```
RRF_score = sum(1 / (k + rank_i)) for each retrieval method i
```

Where `k` is a constant (typically 60) that prevents high-ranked documents from dominating. This works because ranks are comparable across methods even when raw scores are not — BM25 scores might range from 0-15 while vector cosine similarities range from 0-1.

## Adding a Re-Ranker for Maximum Quality

A cross-encoder re-ranker takes the union of results from both methods and re-scores each document against the query. This is slower but significantly more accurate than bi-encoder similarity:

```python
from sentence_transformers import CrossEncoder

class ReRankedHybridRetriever(HybridRetriever):
    def __init__(self, documents, persist_dir="./hybrid_db"):
        super().__init__(documents, persist_dir)
        self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

    def search_with_rerank(
        self, query: str, k: int = 5, initial_k: int = 20, alpha: float = 0.7
    ) -> list[SearchResult]:
        # Get initial candidates from hybrid search
        candidates = self.search(query, k=initial_k, alpha=alpha)

        # Re-rank with cross-encoder
        pairs = [(query, c.content) for c in candidates]
        rerank_scores = self.reranker.predict(pairs)

        # Sort by re-ranker scores
        reranked = sorted(
            zip(candidates, rerank_scores),
            key=lambda x: -x[1]
        )

        return [
            SearchResult(
                content=r.content,
                metadata=r.metadata,
                score=float(score),
                source="reranked"
            )
            for r, score in reranked[:k]
        ]
```

The pattern is: retrieve broadly (top 20-50 from hybrid search), then re-rank precisely (pick top 5).

## Tuning the Alpha Parameter

The `alpha` parameter controls the balance between vector and keyword search. Optimal values depend on your data:

```python
def tune_alpha(retriever, eval_queries, expected_docs, k=5):
    """Find the best alpha by sweeping values."""
    best_alpha = 0.5
    best_recall = 0.0

    for alpha in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
        hits = 0
        for query, expected_id in zip(eval_queries, expected_docs):
            results = retriever.search(query, k=k, alpha=alpha)
            retrieved = [r.content for r in results]
            if any(expected_id in r for r in retrieved):
                hits += 1
        recall = hits / len(eval_queries)
        print(f"alpha={alpha:.1f}: Recall@{k} = {recall:.2%}")
        if recall > best_recall:
            best_recall = recall
            best_alpha = alpha

    print(f"\nBest alpha: {best_alpha} (Recall@{k} = {best_recall:.2%})")
    return best_alpha
```

In practice, alpha between 0.5 and 0.7 works well for most RAG applications — slightly favoring vector search while still benefiting from keyword matching.

## FAQ

### When should I use pure vector search instead of hybrid?

Pure vector search is sufficient when your queries are natural language questions without specific identifiers (no product names, error codes, or acronyms) and your documents are written in consistent natural language. If your corpus contains technical content with specific terms that must match exactly, hybrid search will outperform vector-only retrieval.

### Is re-ranking worth the added latency?

Re-ranking adds 50-200ms depending on the model and number of candidates. For user-facing applications where answer quality matters more than sub-second latency, re-ranking consistently improves retrieval quality by 10-25% on standard benchmarks. For high-throughput batch processing where latency is critical, skip re-ranking.

### Can I use hybrid search with Pinecone or pgvector?

Pinecone supports metadata filtering but not true BM25 keyword search. Weaviate has native hybrid search built in. For pgvector, you can implement BM25 separately using PostgreSQL full-text search (`tsvector` and `tsquery`) and combine results in your application layer using RRF, which works well since everything lives in the same database.

---

#RAG #HybridSearch #BM25 #VectorSearch #Reranking #InformationRetrieval #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/hybrid-search-rag-vector-similarity-keyword-bm25
