Query Rewriting and Multi-Query Expansion for AI Search in 2026
60% of follow-up messages have unresolved coreferences. Query rewriting fixes pronouns, expands recall with multi-query, and applies constraint filters before retrieval ever runs.
TL;DR — Raw user queries are noisy: "what about the second one?" tells the retriever nothing. The 2026 query-rewriting stack handles four jobs in parallel — coreference resolution, expansion (multi-query), step-back abstraction, and constraint extraction — before retrieval ever fires.
The technique
DMQR-RAG (Diverse Multi-Query Rewriting) and the Multi-Query Retriever pattern both rest on one idea: a single query is an under-specified probe. Generate N rewrites covering different angles, retrieve for each, and fuse the lists. Add a step-back rewrite that goes from specific to abstract ("what is the cancellation policy for premium plans on weekends in NYC?" -> "what is the cancellation policy?") to capture parent-context chunks.
For multi-turn voice/chat, the killer step is coreference resolution: replace pronouns and demonstratives with their referents from history. Without it, ~60% of follow-ups retrieve nothing useful.
flowchart LR
H[Chat history] --> CR[Coreference resolver]
Q[Raw query] --> CR
CR --> EX[Multi-query expansion]
CR --> SB[Step-back abstraction]
CR --> CN[Constraint extractor]
EX --> R[Retrieve x N]
SB --> R
CN --> FT[Metadata filter]
R --> FU[RRF fuse]
FT --> FU
FU --> A[Agent]
How it works
A small LLM (Haiku 4.5 or Llama 3.1 8B, ~50–80ms) ingests the last 6 turns plus the new utterance, then emits a JSON with: resolved_query, expansions: [3 paraphrases], stepback, filters: { date_range, status, vertical }. Each rewrite hits the retriever in parallel; results are fused via RRF; metadata filters are applied at the index level (cheap) rather than post-retrieval (expensive).
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The DMQR-RAG paper formalizes four expansion strategies at different information levels — equivalence, generalization, specialization, and adversarial — and shows that diversity matters more than count.
CallSphere implementation
Every CallSphere agent runs a query rewriter. The Healthcare agent resolves "her" -> "patient ID 4421"; UrackIT IT helpdesk resolves "the same error" by injecting the most recent ticket subject; OneRoof real estate resolves "that listing" by pulling the last MLS ID from session memory. The rewriter also extracts constraints — "this week," "under $500k," "in-network" — into structured metadata filters that hit Postgres indexes directly.
37 agents · 90+ tools · 115+ DB tables · 6 verticals. $149 / $499 / $1499, 14-day trial, 22% affiliate. Try the multi-turn flow on /demo or compare verticals at /industries/it-services and /industries/real-estate.
Build steps with code
REWRITE_PROMPT = """Given conversation history and a new user message, output JSON:
{
"resolved": "<query with all pronouns resolved>",
"expansions": ["<3 diverse rewrites>"],
"stepback": "<more abstract version>",
"filters": {"date_range": "...", "vertical": "...", "status": "..."}
}
History: {history}
New message: {message}"""
def rewrite_and_retrieve(history, msg):
plan = json.loads(small_llm.complete(REWRITE_PROMPT.format(history=history, message=msg)))
queries = [plan["resolved"], *plan["expansions"], plan["stepback"]]
results = [hybrid_retrieve(q, filters=plan["filters"]) for q in queries]
return rrf_fuse(results)
- Pin the rewriter model and prompt — version both as code.
- Cache rewrites by (last-3-turns, query) hash.
- Log every rewrite for offline eval; the rewriter is the silent ranker.
- Apply constraint filters at index level, never in Python.
Pitfalls
- Over-expansion: 10 rewrites is noise, not signal. 3–4 is the sweet spot.
- Stepback hallucination: small models invent constraints. Validate with a regex/JSON schema.
- Latency tax: 80ms rewriter + 4 parallel retrieves can blow a voice budget. Run async and timeout aggressively.
- Coreference loops: do not let the rewriter resolve a pronoun to itself. Detect and fall back to raw query.
FAQ
Multi-query or HyDE? Multi-query for breadth; HyDE for depth on abstract queries. They compose.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Do I need a finetuned rewriter? No. A well-prompted Haiku 4.5 or Llama 3.1 8B is enough.
Voice or chat? Both. Voice has tighter latency; the rewriter must be sub-100ms.
Constraint extraction or post-filter? Always constraint extraction — index-side filtering is 10–100x cheaper.
Where on the /demo? Toggle "show internals" to watch the rewriter JSON in real time.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.