---
title: "Contextual Retrieval Architecture for Claude Agents"
description: "End-to-end contextual retrieval for Claude agents: chunk enrichment, dual embedding plus BM25 indexes, rank fusion, and reranking before the model reads."
canonical: https://callsphere.ai/blog/contextual-retrieval-architecture-for-claude-agents
category: "Agentic AI"
tags: ["agentic ai", "claude", "contextual retrieval", "rag", "vector search", "bm25", "architecture"]
author: "CallSphere Team"
published: 2026-01-30T08:00:00.000Z
updated: 2026-06-07T01:28:23.638Z
---

# Contextual Retrieval Architecture for Claude Agents

> End-to-end contextual retrieval for Claude agents: chunk enrichment, dual embedding plus BM25 indexes, rank fusion, and reranking before the model reads.

Plain retrieval-augmented generation has an embarrassing failure mode: it chops a document into chunks, embeds each chunk in isolation, and then acts surprised when a chunk that says *"the limit was raised to 50,000 in Q3"* never surfaces for the query *"what is the API rate limit?"*. The chunk lost the context that told you which API, which quarter, which environment. For a one-shot chatbot you might tolerate the miss. For an agent built on Claude that chains six tool calls off the first retrieval, one bad chunk poisons the whole trajectory.

Contextual retrieval is the architectural fix. Instead of storing raw chunks, you prepend a short, model-generated description that situates each chunk inside its parent document, then index that enriched text in *both* a semantic vector store and a lexical BM25 store. This post walks the full architecture — what each component does, how the pieces connect, and where Claude sits in the loop.

## Key takeaways

- Contextual retrieval enriches every chunk with a 50–100 token, document-aware description **before** embedding, which sharply cuts the rate at which the right chunk fails to surface.
- The architecture runs two indexes in parallel — semantic embeddings and lexical BM25 — and fuses their results so exact identifiers and fuzzy meaning both win.
- Context generation is a one-time, offline cost; use Claude Haiku 4.5 with prompt caching on the full document to keep it cheap.
- A reranking stage after fusion is what turns "good recall" into "the model only sees the 5 chunks that matter."
- For agents, retrieval quality compounds: a clean first retrieval shortens the whole tool-call trajectory.

## What problem is the architecture actually solving?

Standard chunking destroys reference. A 200-token slice from page 14 of a contract might read "Either party may terminate with 30 days notice." Embedded alone, that vector lives near every other termination clause on the internet. It does not know it belongs to *the master services agreement with Acme, governed by New York law*. So a query like "how do I cancel the Acme MSA?" pulls back a neighbor's lease instead.

The second, quieter problem is lexical. Embeddings are great at meaning and terrible at exact strings. Queries that hinge on an error code, a SKU, a function name, or a version number ("ERR_2041", "v4.6.2") need exact-match retrieval, which is precisely what BM25 gives you and dense vectors fumble. A serious architecture refuses to choose between the two.

Contextual retrieval is a technique that prepends each chunk with a short, LLM-generated explanation of how the chunk fits into its source document, then indexes the combined text for both semantic and keyword search. That single definition captures the whole design: enrich, then index twice.

## How the components fit together

The pipeline splits cleanly into an offline indexing path and an online query path. Indexing happens once per document version; querying happens on every agent turn that needs grounding.

```mermaid
flowchart TD
  A["Source document"] --> B["Split into chunks"]
  B --> C["Claude Haiku writes chunk context (whole doc cached)"]
  C --> D["Enriched chunk = context + original"]
  D --> E["Embed into vector index"]
  D --> F["Tokenize into BM25 index"]
  G["Agent query"] --> H["Retrieve top-K from both indexes"]
  E --> H
  F --> H
  H --> I["Rank fusion + rerank to top-N"]
  I --> J["Claude reads N chunks & acts"]
```

The left branch (A through F) is your batch job. The right branch (G through J) runs in milliseconds at request time. Keeping them separate matters: context generation is the expensive, latency-tolerant step, and you never want it on the hot path.

## The enrichment step in detail

For each chunk, you send Claude the entire parent document plus the specific chunk, and ask for a terse situating sentence — not a summary of the chunk, but the context the chunk is missing. The output is something like "This clause is from the termination section of the Acme Master Services Agreement (2026), governed by New York law." You prepend that to the original chunk text before indexing.

```
CONTEXT_PROMPT = """{full_document}
Here is a chunk we want to situate within the whole document:
{chunk_text}
Give a short, standalone context (1-2 sentences) that says what
this chunk is about and where it sits in the document. Answer
with ONLY that context."""
```

The economic trick is prompt caching. The full document is identical across every chunk in the same document, so you cache it once and pay the cheap cached-read rate for each subsequent chunk. With Haiku 4.5 as the writer, enriching a large knowledge base becomes a back-of-the-envelope rounding error rather than a budget line.

## Dual indexing and rank fusion

Each enriched chunk goes into two stores. The vector index (any standard embedding model) handles semantic similarity. The BM25 index handles exact lexical overlap. At query time you pull the top results from each — say top-20 semantic and top-20 lexical — then merge them.

The merge is usually reciprocal rank fusion: each document gets a score based on its rank position in each list, and you sum across lists. A chunk that ranks #2 semantically and #4 lexically beats one that ranks #1 in only a single list. This is deliberately rank-based, not score-based, so you do not have to normalize incompatible similarity scales.

| Stage | Wins at | Misses |
| --- | --- | --- |
| Semantic only | Paraphrase, intent | Exact IDs, codes |
| BM25 only | Exact strings | Synonyms, meaning |
| Fused + context | Both | Very little |

## Why a reranker closes the loop

Fusion gives you high recall — the right chunk is probably in your top-20. But an agent should not read 20 chunks; that wastes context window and dilutes attention. A cross-encoder reranker reads the query and each candidate together and produces a true relevance score, letting you cut from 20 candidates to the 5 the model actually consumes.

For Claude agents this is where architecture meets economics. Fewer, sharper chunks mean a tighter system prompt, less chance of the model anchoring on an irrelevant passage, and shorter tool-call chains downstream. The reranker is the difference between an agent that confidently answers from the right paragraph and one that hedges across five half-relevant ones.

It is worth naming why this ordering of stages is not arbitrary. Recall and precision pull in opposite directions, and the architecture deliberately maximizes each at the stage best suited to it. Fusion is a recall stage: cast a wide net so the right chunk is almost certainly somewhere in the candidate pool, accepting that the pool is noisy. Reranking is a precision stage: read each candidate against the query with a heavier model that you could never afford to run over the whole corpus, and trust it to surface the few that truly answer the question. Trying to collapse the two — using one model to do both wide search and precise ranking — either costs too much to run at corpus scale or is too blunt to separate the near-misses from the hits. Splitting them is what makes contextual retrieval both affordable and accurate at once.

## Ship this architecture in 5 steps

1. Chunk your corpus with a sane splitter (respect headings and natural boundaries; aim for a few hundred tokens per chunk).
2. Generate per-chunk context with Claude Haiku 4.5, caching the full document across the batch.
3. Index every enriched chunk into both a vector store and a BM25 store.
4. At query time, retrieve top-K from each, fuse with reciprocal rank fusion, and rerank to the top 5–8.
5. Pass only those final chunks to Claude, with their source metadata, and let the agent act.

## Common pitfalls

- **Re-running enrichment on the hot path.** Context generation is offline. If you see it in your request latency, your architecture is wrong — cache the enriched chunks at index time.
- **Skipping BM25 because "embeddings are better."** They are not better at exact identifiers. Drop lexical search and your error-code and SKU queries quietly rot.
- **Forgetting prompt caching.** Without it, enrichment cost scales with chunks times document size and becomes genuinely expensive. With it, it nearly vanishes.
- **Stuffing the reranked output too wide.** Sending 20 chunks to the model defeats the point. Rerank hard; precision beats volume in an agent loop.
- **Letting context descriptions balloon.** One or two sentences. A paragraph of context per chunk just adds noise to the embedding.

## Frequently asked questions

### Does contextual retrieval replace fine-tuning?

No — they solve different problems. Fine-tuning changes how the model behaves; contextual retrieval changes what facts the model can see at inference. Most teams reach for retrieval first because it is cheaper, instantly updatable, and auditable.

### How much does the extra context inflate my index?

Each chunk grows by 50–100 tokens. That modestly increases storage and embedding cost at index time, but it is a one-time cost and the recall gain dwarfs it. Query-time cost is unchanged because you still retrieve a fixed top-K.

### Which Claude model should generate the context?

Haiku 4.5. The task is short, well-specified, and runs over your entire corpus, so you want the cheapest capable model with prompt caching. Reserve Sonnet or Opus for the agent's reasoning, not for bulk enrichment.

## Bringing agentic AI to your phone lines

CallSphere builds the same contextual-retrieval and multi-agent patterns into **voice and chat** assistants that answer every call, pull the right account detail mid-conversation, and book work around the clock. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/contextual-retrieval-architecture-for-claude-agents
