---
title: "Query Rewriting and Multi-Query Expansion for AI Search in 2026"
description: "60% of follow-up messages have unresolved coreferences. Query rewriting fixes pronouns, expands recall with multi-query, and applies constraint filters before retrieval ever runs."
canonical: https://callsphere.ai/blog/vw6g-query-rewriting-expansion-multi-query-2026
category: "AI Engineering"
tags: ["Query Rewriting", "Multi-Query", "RAG", "Conversational AI", "Retrieval"]
author: "CallSphere Team"
published: 2026-03-24T00:00:00.000Z
updated: 2026-05-07T16:46:10.425Z
---

# Query Rewriting and Multi-Query Expansion for AI Search in 2026

> 60% of follow-up messages have unresolved coreferences. Query rewriting fixes pronouns, expands recall with multi-query, and applies constraint filters before retrieval ever runs.

> **TL;DR** — Raw user queries are noisy: "what about the second one?" tells the retriever nothing. The 2026 query-rewriting stack handles four jobs in parallel — coreference resolution, expansion (multi-query), step-back abstraction, and constraint extraction — before retrieval ever fires.

## The technique

DMQR-RAG (Diverse Multi-Query Rewriting) and the Multi-Query Retriever pattern both rest on one idea: a single query is an under-specified probe. Generate N rewrites covering different angles, retrieve for each, and fuse the lists. Add a step-back rewrite that goes from specific to abstract ("what is the cancellation policy for premium plans on weekends in NYC?" -> "what is the cancellation policy?") to capture parent-context chunks.

For multi-turn voice/chat, the killer step is **coreference resolution**: replace pronouns and demonstratives with their referents from history. Without it, ~60% of follow-ups retrieve nothing useful.

```mermaid
flowchart LR
  H[Chat history] --> CR[Coreference resolver]
  Q[Raw query] --> CR
  CR --> EX[Multi-query expansion]
  CR --> SB[Step-back abstraction]
  CR --> CN[Constraint extractor]
  EX --> R[Retrieve x N]
  SB --> R
  CN --> FT[Metadata filter]
  R --> FU[RRF fuse]
  FT --> FU
  FU --> A[Agent]
```

## How it works

A small LLM (Haiku 4.5 or Llama 3.1 8B, ~50–80ms) ingests the last 6 turns plus the new utterance, then emits a JSON with: `resolved_query`, `expansions: [3 paraphrases]`, `stepback`, `filters: { date_range, status, vertical }`. Each rewrite hits the retriever in parallel; results are fused via RRF; metadata filters are applied at the index level (cheap) rather than post-retrieval (expensive).

The DMQR-RAG paper formalizes four expansion strategies at different information levels — equivalence, generalization, specialization, and adversarial — and shows that diversity matters more than count.

## CallSphere implementation

Every CallSphere agent runs a query rewriter. The Healthcare agent resolves "her" -> "patient ID 4421"; UrackIT IT helpdesk resolves "the same error" by injecting the most recent ticket subject; OneRoof real estate resolves "that listing" by pulling the last MLS ID from session memory. The rewriter also extracts constraints — "this week," "under $500k," "in-network" — into structured metadata filters that hit Postgres indexes directly.

37 agents · 90+ tools · 115+ DB tables · 6 verticals. **$149 / $499 / $1499**, [14-day trial](/trial), [22% affiliate](/affiliate). Try the multi-turn flow on [/demo](/demo) or compare verticals at [/industries/it-services](/industries/it-services) and [/industries/real-estate](/industries/real-estate).

## Build steps with code

```python
REWRITE_PROMPT = """Given conversation history and a new user message, output JSON:
{
  "resolved": "",
  "expansions": [""],
  "stepback": "",
  "filters": {"date_range": "...", "vertical": "...", "status": "..."}
}
History: {history}
New message: {message}"""

def rewrite_and_retrieve(history, msg):
    plan = json.loads(small_llm.complete(REWRITE_PROMPT.format(history=history, message=msg)))
    queries = [plan["resolved"], *plan["expansions"], plan["stepback"]]
    results = [hybrid_retrieve(q, filters=plan["filters"]) for q in queries]
    return rrf_fuse(results)
```

1. Pin the rewriter model and prompt — version both as code.
2. Cache rewrites by (last-3-turns, query) hash.
3. Log every rewrite for offline eval; the rewriter is the silent ranker.
4. Apply constraint filters at index level, never in Python.

## Pitfalls

- **Over-expansion**: 10 rewrites is noise, not signal. 3–4 is the sweet spot.
- **Stepback hallucination**: small models invent constraints. Validate with a regex/JSON schema.
- **Latency tax**: 80ms rewriter + 4 parallel retrieves can blow a voice budget. Run async and timeout aggressively.
- **Coreference loops**: do not let the rewriter resolve a pronoun to itself. Detect and fall back to raw query.

## FAQ

**Multi-query or HyDE?** Multi-query for breadth; HyDE for depth on abstract queries. They compose.

**Do I need a finetuned rewriter?** No. A well-prompted Haiku 4.5 or Llama 3.1 8B is enough.

**Voice or chat?** Both. Voice has tighter latency; the rewriter must be sub-100ms.

**Constraint extraction or post-filter?** Always constraint extraction — index-side filtering is 10–100x cheaper.

**Where on the /demo?** Toggle "show internals" to watch the rewriter JSON in real time.

## Sources

- [DMQR-RAG: Diverse Multi-Query Rewriting - OpenReview](https://openreview.net/forum?id=lz936bYmb3)
- [Advanced RAG: Query Expansion - Haystack](https://haystack.deepset.ai/blog/query-expansion)
- [RAG Query Rewriting: 4 Layers That Fix Multi-Turn Retrieval - Alhena](https://alhena.ai/blog/query-rewriting-before-retrieval-multi-turn-rag/)
- [In-Depth RAG Query Transformation - DEV](https://dev.to/jamesli/in-depth-understanding-of-rag-query-transformation-optimization-multi-query-problem-decomposition-and-step-back-27jg)

---

Source: https://callsphere.ai/blog/vw6g-query-rewriting-expansion-multi-query-2026