---
title: "Build Contextual Retrieval Step by Step with Claude"
description: "Runnable walkthrough: chunk, contextualize with Claude Haiku + prompt caching, index vectors and BM25, fuse, rerank, and ground a Claude agent."
canonical: https://callsphere.ai/blog/build-contextual-retrieval-step-by-step-with-claude
category: "Agentic AI"
tags: ["agentic ai", "claude", "contextual retrieval", "rag", "implementation", "prompt caching", "anthropic"]
author: "CallSphere Team"
published: 2026-01-30T08:23:11.000Z
updated: 2026-06-07T01:28:23.644Z
---

# Build Contextual Retrieval Step by Step with Claude

> Runnable walkthrough: chunk, contextualize with Claude Haiku + prompt caching, index vectors and BM25, fuse, rerank, and ground a Claude agent.

Architecture diagrams are nice, but at some point you have to write the code. This post is the build log: a working contextual-retrieval pipeline you can follow command by command, from a folder of documents to a Claude agent answering grounded questions. No hand-waving — every stage has the actual call shape it needs.

We will assume you already have an Anthropic API key and a Python environment. The pipeline has two halves: an offline indexer you run once per corpus version, and an online retriever the agent calls per turn. We build the offline half first because the online half depends on its output.

## Key takeaways

- You can stand up contextual retrieval in an afternoon — the hard part is wiring prompt caching, not the search itself.
- Generate chunk context with one Claude Haiku 4.5 call per chunk, caching the full document so repeated reads are cheap.
- Store enriched chunks in a vector index and a BM25 index; both read from the same enriched text.
- At query time, retrieve, fuse with reciprocal rank fusion, rerank, then hand the top results to Claude.
- Persist your enriched chunks to disk so you never regenerate context for unchanged documents.

## Step 1 — chunk the corpus

Start by splitting each document into chunks that respect structure. Splitting on blind character counts breaks sentences mid-thought; split on headings and paragraphs first, then pack to a target size of a few hundred tokens. Keep a back-reference from every chunk to its parent document, because the next step needs the whole document on hand.

```
chunks = []
for doc in load_documents("./corpus"):
    for piece in split_on_headings(doc.text, target_tokens=300):
        chunks.append({"doc_id": doc.id,
                       "doc_text": doc.text,
                       "text": piece})
```

Carrying `doc_text` on each chunk looks wasteful, but it is exactly what lets us cache the document in the next step. We discard it before indexing.

A word on chunk size, because it quietly governs everything downstream. Too small and a chunk loses enough surrounding text that even contextualization cannot rescue it; too large and a single chunk spans two unrelated topics, so neither index ranks it cleanly for either. A few hundred tokens is a defensible default for prose. For structured material — tables, code, configuration — prefer to split on logical units like a whole table or a whole function even when that pushes a chunk slightly over target, because cutting through a table mid-row destroys its meaning faster than length ever would. Keep a small overlap between adjacent chunks if your documents flow continuously, so a sentence that straddles a boundary still appears whole in at least one chunk.

## Step 2 — generate context with prompt caching

This is the core move. For each chunk, send Claude the full parent document and the chunk, and ask for one or two situating sentences. Mark the document block as cacheable so every chunk after the first in the same document reads it at the cheap cached rate.

```
import anthropic
client = anthropic.Anthropic()

def contextualize(doc_text, chunk_text):
    msg = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=120,
        messages=[{"role": "user", "content": [
            {"type": "text",
             "text": f"{doc_text}",
             "cache_control": {"type": "ephemeral"}},
            {"type": "text",
             "text": f"Chunk:\n{chunk_text}\n\nGive 1-2 sentences "
                     "situating this chunk in the document. "
                     "Output only that context."}
        ]}])
    return msg.content[0].text.strip()
```

The `cache_control` marker on the document block is what makes bulk enrichment affordable. Process all chunks of one document back to back so the cache stays warm for that document before you move to the next.

It helps to picture the cost. Without caching, each chunk re-reads the entire document, so a hundred-chunk document pays for a hundred full document reads — the cost scales with chunks times document length and grows alarmingly on a real knowledge base. With caching, the document is read once at the full rate and then served from cache for every subsequent chunk at a small fraction of that price. Choosing Haiku 4.5 for the writer compounds the saving: it is the cheapest capable model in the family and the situating task is short and well-bounded, so you are not paying frontier-model rates to write a one-sentence header. Treat enrichment as a batch job you run on a schedule, not something a user ever waits on.

## How the build flows

```mermaid
flowchart TD
  A["./corpus folder"] --> B["Step 1: chunk"]
  B --> C["Step 2: Haiku context (cached doc)"]
  C --> D["enriched = context + chunk"]
  D --> E["Step 3a: vector index"]
  D --> F["Step 3b: BM25 index"]
  D --> G["persist enriched.jsonl"]
  H["agent query"] --> I["Step 4: retrieve + fuse + rerank"]
  E --> I
  F --> I
  I --> J["Step 5: Claude answers"]
```

Notice the `persist` branch (G). Writing enriched chunks to a JSONL file means a re-index after a code change never re-pays the enrichment cost — you only re-run Step 2 for documents whose text actually changed.

## Step 3 — build both indexes

Embed the enriched text into your vector store of choice, and tokenize the same enriched text into a BM25 index. The key discipline: both indexes read the *enriched* string, not the raw chunk. If your BM25 index sees only raw text while your vectors see enriched text, fusion compares apples to oranges.

```
enriched = f"{context}\n\n{chunk['text']}"
vector_index.add(embed(enriched), meta=chunk)
bm25_index.add(tokenize(enriched), meta=chunk)
```

Store the original chunk text and source metadata alongside each entry. At answer time you want to show Claude the clean chunk plus a citation, not the context sentence you generated for retrieval purposes.

Pick the embedding model deliberately and then pin it. Re-embedding a large corpus is expensive, so a mid-quality model you keep is often better in practice than a top model you swap out every quarter. Whatever you choose, the vector index and the BM25 index must be built from the same enriched string in the same pass — the most common reason fusion underperforms is that the two indexes drifted out of sync because someone rebuilt one and forgot the other. Write a single indexing function that adds to both stores from one record, so it is structurally impossible for them to disagree.

## Step 4 — retrieve, fuse, and rerank

On each query, pull the top-K from both indexes, fuse with reciprocal rank fusion, then rerank the fused list with a cross-encoder down to the handful you will actually pass to Claude.

```
def retrieve(query, k=20, final=6):
    v = vector_index.search(embed(query), k)
    b = bm25_index.search(tokenize(query), k)
    fused = reciprocal_rank_fusion([v, b])
    return rerank(query, fused)[:final]
```

Reciprocal rank fusion needs no score normalization — it scores by rank position, summing `1 / (rank + constant)` across lists. That makes it robust to the fact that cosine similarity and BM25 scores live on totally different scales.

## Step 5 — hand the chunks to a Claude agent

Finally, format the top chunks with their sources and put them in the agent's context. Keep the system prompt strict: answer only from the provided chunks, cite the source, and say so when the chunks do not contain the answer. That last instruction is what keeps a retrieval-grounded agent honest.

One detail decides whether this whole build feels production-grade: pass the source id with every chunk and require Claude to cite it. Citations are not decoration — they make answers auditable, let a human jump straight to the underlying document, and give you a cheap signal during evaluation, because you can check whether the cited chunk actually supports the claim. When you later wire retrieval behind a tool the agent calls on demand, the same formatted-chunk-with-citation shape carries over unchanged, so it is worth getting right now while the pipeline is small enough to reason about end to end.

```
context_block = "\n\n".join(
    f"[{c['doc_id']}] {c['text']}" for c in retrieve(query))
resp = client.messages.create(
    model="claude-sonnet-4-6",
    system="Answer only from the chunks. Cite [doc_id]. "
           "If the answer is not present, say so.",
    messages=[{"role": "user",
               "content": f"{context_block}\n\nQ: {query}"}])
```

## Common pitfalls

- **Embedding raw chunks but BM25-indexing enriched text (or vice versa).** Keep them identical or fusion is meaningless.
- **Re-contextualizing the whole corpus on every deploy.** Persist enriched chunks and key them by document hash; only redo what changed.
- **Letting the cache go cold.** Interleaving documents during enrichment evicts the cached document and you pay full price every call. Batch by document.
- **Passing the context sentence to the model as if it were source text.** The context is for retrieval; show Claude the original chunk and cite the real source.
- **Skipping the "say so when unknown" instruction.** Without it, a thin retrieval turns into a confident hallucination.

## Frequently asked questions

### How long does indexing a mid-size corpus take?

The bottleneck is enrichment API calls, which parallelize well. A few thousand chunks finish in minutes if you run several concurrent requests and keep document caches warm. Embedding and BM25 indexing are comparatively instant.

### Can I add new documents without rebuilding everything?

Yes. Contextual retrieval is incremental — chunk the new document, contextualize just its chunks, and append to both indexes. Nothing about existing entries changes, which is the whole reason you persisted them.

### Do I need a managed vector database?

Not to start. An in-memory vector index plus an in-process BM25 library is enough to prove the pipeline. Reach for a managed store when corpus size or concurrent traffic outgrows a single process.

## Bringing agentic AI to your phone lines

CallSphere takes this exact build pattern — enrich, index, fuse, rerank, ground — and runs it behind **voice and chat** agents that answer every call and message, fetch the right record mid-conversation, and book jobs 24/7. See it working at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/build-contextual-retrieval-step-by-step-with-claude
