---
title: "Advanced RAG Patterns: Query Rewriting, HyDE, and Multi-Step Retrieval"
description: "Go beyond basic RAG with advanced retrieval patterns including query rewriting, hypothetical document embeddings (HyDE), step-back prompting, and iterative multi-step retrieval chains."
canonical: https://callsphere.ai/blog/advanced-rag-patterns-query-rewriting-hyde-multi-step-retrieval
category: "Learn Agentic AI"
tags: ["RAG", "Advanced Retrieval", "HyDE", "Query Rewriting", "Multi-Step Retrieval", "LLM"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:42.608Z
---

# Advanced RAG Patterns: Query Rewriting, HyDE, and Multi-Step Retrieval

> Go beyond basic RAG with advanced retrieval patterns including query rewriting, hypothetical document embeddings (HyDE), step-back prompting, and iterative multi-step retrieval chains.

## When Basic RAG Falls Short

Basic RAG follows a simple pattern: embed the user's query, find similar documents, generate an answer. This works well for straightforward factual questions but struggles with three common scenarios:

1. **Vague or poorly worded queries** — "how does the thing work" retrieves nothing useful
2. **Vocabulary mismatch** — the user says "cancel my account" but the docs say "subscription termination"
3. **Multi-hop questions** — "Which of our enterprise customers in healthcare had SLA violations last quarter?" requires multiple retrieval steps

Advanced RAG patterns address each of these failure modes. This post covers four production-proven techniques.

## Pattern 1: Query Rewriting

Query rewriting uses an LLM to transform the user's original query into one (or multiple) queries that are more likely to retrieve relevant documents.

```mermaid
flowchart LR
    Q(["User query"])
    REWRITE["Query rewrite
HyDE plus expansion"]
    HYBRID{"Hybrid search"}
    BM25["BM25 keyword
Postgres FTS"]
    DENSE["Dense vector
ANN search"]
    FUSE["Reciprocal rank
fusion"]
    RERANK["Cross encoder
reranker"]
    PACK["Context packing
and dedupe"]
    LLM["LLM generation"]
    OUT(["Cited answer"])
    Q --> REWRITE --> HYBRID
    HYBRID --> BM25 --> FUSE
    HYBRID --> DENSE --> FUSE
    FUSE --> RERANK --> PACK --> LLM --> OUT
    style HYBRID fill:#f59e0b,stroke:#d97706,color:#1f2937
    style RERANK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.3)

def rewrite_query(original_query: str, num_variants: int = 3) -> list[str]:
    """Generate multiple search queries from the original question."""
    prompt = f"""You are a search query optimizer for a RAG system.
Given the user's question, generate {num_variants} different search queries
that would help find the relevant information in a knowledge base.

Each query should approach the question from a different angle or use
different terminology.

User question: {original_query}

Return only the queries, one per line, no numbering."""

    response = llm.invoke(prompt)
    queries = [q.strip() for q in response.content.strip().split("\n") if q.strip()]
    return queries

# Example
original = "how does the thing with payments work"
rewritten = rewrite_query(original)
for q in rewritten:
    print(f"  -> {q}")
# Output:
#   -> How does the payment processing system function?
#   -> What is the billing and payment workflow?
#   -> Payment integration setup and configuration guide
```

Now retrieve with all queries and merge the results:

```python
def multi_query_retrieve(queries: list[str], retriever, k: int = 5) -> list:
    """Retrieve documents using multiple queries, deduplicate by content."""
    all_docs = []
    seen_content = set()

    for query in queries:
        docs = retriever.invoke(query)
        for doc in docs:
            content_hash = hash(doc.page_content)
            if content_hash not in seen_content:
                seen_content.add(content_hash)
                all_docs.append(doc)

    # Return top k by order of appearance (first retrieved = most relevant)
    return all_docs[:k]
```

## Pattern 2: HyDE — Hypothetical Document Embeddings

HyDE is a counterintuitive but effective technique. Instead of embedding the question, you ask the LLM to generate a hypothetical answer (even if it is wrong), then embed that hypothetical answer and use it as the search vector.

The insight is that a hypothetical answer is closer in embedding space to the real document than the question itself. Questions and answers live in different semantic neighborhoods — HyDE bridges this gap.

```python
def hyde_retrieve(question: str, retriever, llm, k: int = 5) -> list:
    """
    Hypothetical Document Embeddings:
    1. Generate a hypothetical answer
    2. Embed the hypothetical answer
    3. Use it to search for real documents
    """
    # Step 1: Generate hypothetical answer
    hyde_prompt = f"""Write a detailed paragraph that would answer the following question.
Write as if you are writing a section of a technical document.
Do not mention that this is hypothetical.

Question: {question}

Answer paragraph:"""

    hypothetical_doc = llm.invoke(hyde_prompt).content

    # Step 2-3: Use the hypothetical doc as the search query
    # The retriever will embed this text and find similar real documents
    docs = retriever.invoke(hypothetical_doc)

    return docs[:k]

# Usage
question = "What security measures protect customer payment data?"
docs = hyde_retrieve(question, retriever, llm)
for doc in docs:
    print(f"Retrieved: {doc.page_content[:100]}...")
```

**When HyDE helps most:** Technical questions where users describe problems in different terms than the documentation. Customer support queries where the question vocabulary differs significantly from the knowledge base vocabulary.

**When to skip HyDE:** Simple factual lookups, queries that already use domain terminology, latency-sensitive applications (HyDE adds an LLM call before retrieval).

## Pattern 3: Step-Back Prompting

Step-back prompting handles overly specific queries by first generating a more general version of the question, retrieving for both, and combining the context.

```python
def step_back_retrieve(question: str, retriever, llm, k: int = 5) -> list:
    """
    Retrieve using both the original question and a more general version.
    """
    # Generate step-back question
    step_back_prompt = f"""Given a specific question, generate a more general
question that would retrieve broader context helpful for answering
the specific question.

Specific question: {question}
General question:"""

    general_question = llm.invoke(step_back_prompt).content.strip()

    # Retrieve for both
    specific_docs = retriever.invoke(question)
    general_docs = retriever.invoke(general_question)

    # Merge with deduplication
    seen = set()
    merged = []
    for doc in specific_docs + general_docs:
        key = hash(doc.page_content)
        if key not in seen:
            seen.add(key)
            merged.append(doc)

    return merged[:k]

# Example
question = "What is the TLS version used for API endpoints in the EU region?"
# Step-back generates: "What are the security and encryption standards for API endpoints?"
# This retrieves both the specific TLS doc and the broader security architecture doc
docs = step_back_retrieve(question, retriever, llm)
```

## Pattern 4: Iterative Multi-Step Retrieval

For complex questions that require information from multiple documents, iterative retrieval performs multiple rounds of search, using information gathered in each round to refine subsequent queries.

```python
def multi_step_retrieve(
    question: str,
    retriever,
    llm,
    max_steps: int = 3,
    k_per_step: int = 3,
) -> dict:
    """
    Iterative retrieval: use each round's findings to inform the next query.
    """
    all_context = []
    queries_used = [question]

    for step in range(max_steps):
        # Retrieve for current query
        current_query = queries_used[-1]
        docs = retriever.invoke(current_query)[:k_per_step]
        new_context = [doc.page_content for doc in docs]
        all_context.extend(new_context)

        # Check if we have enough to answer
        check_prompt = f"""Given the question and the context gathered so far,
determine if we have enough information to answer completely.

Question: {question}

Context gathered:
{chr(10).join(all_context)}

If we have enough information, respond with: SUFFICIENT
If we need more information, respond with a follow-up search query
that would find the missing pieces."""

        check_response = llm.invoke(check_prompt).content.strip()

        if "SUFFICIENT" in check_response.upper():
            break
        else:
            queries_used.append(check_response)

    return {
        "context": all_context,
        "steps": len(queries_used),
        "queries": queries_used,
    }
```

## Combining Patterns

In production, these patterns compose naturally:

```
User Query
    |
    v
Query Rewriting (generate 3 variants)
    |
    v
For each variant: HyDE (generate hypothetical doc)
    |
    v
Retrieve top-k for each hypothetical doc
    |
    v
Merge + Deduplicate all results
    |
    v
Re-rank with cross-encoder
    |
    v
Top-5 chunks -> LLM generation
```

Each additional layer adds latency but improves retrieval quality. Start with basic RAG, measure where retrieval fails, and add the pattern that addresses your specific failure mode.

## FAQ

### Does HyDE work if the LLM hallucinates the hypothetical answer?

Yes, and this is the counterintuitive insight. Even a factually wrong hypothetical answer uses the right vocabulary, structure, and semantic space of a real answer. The embedding of a wrong answer about "TLS 1.3 encryption for API endpoints" is still closer to the real documentation about API encryption than the original question "What security does the API use?"

### How much latency does query rewriting add?

Query rewriting adds one LLM call (100-500ms with GPT-4o-mini) before retrieval begins. If you then retrieve with 3 query variants in parallel, the total added latency is just the rewriting call — the parallel retrievals take the same time as a single retrieval. This is usually an acceptable tradeoff for the retrieval quality improvement.

### When should I use multi-step retrieval vs. just retrieving more documents?

Multi-step retrieval is better when the answer requires synthesizing information from documents that would not be retrieved together by a single query. For example, answering "Which customers affected by the Q3 outage are also on expired contracts?" requires first finding outage-affected customers, then looking up their contract status. Retrieving more documents with a single query would not find this cross-referenced information.

---

#RAG #AdvancedRetrieval #HyDE #QueryRewriting #MultiStepRetrieval #LLM #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/advanced-rag-patterns-query-rewriting-hyde-multi-step-retrieval
