---
title: "Contextual Retrieval Code Patterns for Claude Agents"
description: "Reusable contextual RAG patterns: structure the enrichment prompt, shape retrieved chunks, and expose retrieval as a typed tool Claude agents call on demand."
canonical: https://callsphere.ai/blog/contextual-retrieval-code-patterns-for-claude-agents
category: "Agentic AI"
tags: ["agentic ai", "claude", "contextual retrieval", "rag", "prompt engineering", "tool use", "design patterns"]
author: "CallSphere Team"
published: 2026-01-30T08:46:22.000Z
updated: 2026-06-07T01:28:23.648Z
---

# Contextual Retrieval Code Patterns for Claude Agents

> Reusable contextual RAG patterns: structure the enrichment prompt, shape retrieved chunks, and expose retrieval as a typed tool Claude agents call on demand.

Once you have built contextual retrieval once, you start noticing the same handful of design decisions every time: how to phrase the enrichment prompt, how to shape what comes back from retrieval, where to draw the line between the retriever and the agent. Getting these patterns right is the difference between a pipeline you reuse across five projects and one you rewrite from scratch each time. This post is the pattern library.

The framing here is deliberately code-level. We are not re-deriving the architecture — we are codifying the reusable shapes: prompt structure, data structure, and the tool contract that lets a Claude agent treat retrieval as just another tool it can call when it decides it needs grounding.

## Key takeaways

- Put the document *before* the chunk in the enrichment prompt and ask for context, not a summary — order and intent both matter.
- Return retrieved chunks as structured records with text, source, and score so the agent can cite and reason about confidence.
- Expose retrieval as a typed tool the agent calls on demand, not a step you always run before every turn.
- Cache the document block, keep the instruction block uncached, and you get cheap, correct enrichment.
- Separate "retrieval text" from "display text" — enrich for search, show the original for citation.

## Pattern 1 — the enrichment prompt shape

The enrichment prompt has a fixed skeleton worth memorizing: cacheable document block first, then the specific chunk, then a tight instruction that asks for situating context and forbids preamble. Order matters because the cache key is a prefix — the document must come first to be reusable across chunks.

```
[
  {"type": "text", "text": "...full document...",
   "cache_control": {"type": "ephemeral"}},
  {"type": "text", "text":
     "Chunk:\n{chunk}\n\nWrite 1-2 sentences situating this "
     "chunk in the document above. No preamble, context only."}
]
```

The phrase "no preamble, context only" earns its keep. Without it, smaller models love to answer "Sure! Here is the context:" and you end up storing that politeness in your index. Constrain the output shape in the prompt, not with post-processing.

Two refinements pay off once you run this at scale. First, give the instruction a concrete shape to imitate — a single example of the situating sentence you want — and the model converges on it immediately instead of guessing your house style. Second, cap the output length explicitly in the request so a chatty model cannot turn a one-line header into a paragraph that dilutes the embedding. The enrichment prompt is one of the few places where being terse and prescriptive strictly beats being open-ended, because you are generating index metadata, not prose a human will read.

## Pattern 2 — the enriched-chunk record

Do not flatten everything into one string. Keep a structured record so downstream stages can use each field for its own purpose: the enriched text feeds the indexes, the original text feeds the model, and the metadata feeds citations and filters.

```
{
  "id": "acme-msa#14",
  "doc_id": "acme-msa",
  "text": "Either party may terminate with 30 days notice.",
  "context": "Termination clause of the Acme MSA (2026), NY law.",
  "retrieval_text": "Termination clause... Either party...",
  "section": "Termination"
}
```

The split between `retrieval_text` (context + original, indexed) and `text` (original, shown to Claude) is the single most reused pattern in this whole post. Indexes search the enriched string; the agent reads and cites the clean string.

Keeping the record structured also future-proofs you. The `section` field becomes a filter when someone asks for "the termination terms only." The stable `id` becomes a citation handle and a cache key. The `doc_id` lets you invalidate every chunk of a document in one operation when that document changes. None of these uses are obvious on day one, but each one is painful to retrofit if you started by flattening everything into a single opaque string. Spend the few extra bytes per record now; they buy you flexibility that compounds across every feature you build on top of retrieval later.

## How an agent invokes the pattern

```mermaid
flowchart TD
  A["Agent turn"] --> B{"Need grounding?"}
  B -->|No| C["Answer directly"]
  B -->|Yes| D["Call search_kb tool"]
  D --> E["Retriever: fuse + rerank"]
  E --> F["Return structured chunks"]
  F --> G["Agent reads text, cites source"]
  G --> H{"Enough to answer?"}
  H -->|No| D
  H -->|Yes| I["Final grounded answer"]
```

The loop back from H to D is the pattern that distinguishes an agent from a pipeline. The model decides whether the first retrieval was enough and re-queries with a refined search if not. Your tool contract has to make that cheap and natural.

## Pattern 3 — retrieval as a typed tool

Rather than always retrieving before the model speaks, expose retrieval as a tool the agent calls when it judges grounding is needed. This keeps cheap conversational turns cheap and lets the model issue multiple, sharper searches when a question is genuinely complex.

```
{
  "name": "search_kb",
  "description": "Search the knowledge base for grounding "
    "facts. Call when you need specifics you are unsure of.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {"type": "string"},
      "top_k": {"type": "integer", "default": 6}
    },
    "required": ["query"]
  }
}
```

The description is doing real work: it tells Claude *when* to call the tool, not just what it does. A vague description ("searches documents") gets called erratically; an intent-laden one ("call when you need specifics you are unsure of") gets called at the right moments.

The deeper pattern is letting the agent own the decision to retrieve. In a pipeline, retrieval is a fixed step that always fires; in an agent, it is a capability the model invokes when its own uncertainty warrants it. That shift is what enables the refinement loop — the agent can read a thin first result, reformulate the query with the vocabulary it just learned from the returned chunks, and search again. To make that loop productive rather than wasteful, keep the tool fast and cheap enough that a second call is never something the model should hesitate over, and make the returned structure rich enough that the agent can tell a genuinely empty result from a merely imperfect one. A tool that is expensive or opaque quietly discourages the very behavior that makes agentic retrieval better than a one-shot pipeline.

## Pattern 4 — structured returns the model can reason over

Return tool results as a list of records, each carrying text, a source id, and a relevance score. The score lets the model calibrate: if every returned chunk scored low, a well-prompted agent will hedge or re-query rather than fabricate. Hand back bare strings and you throw away that signal.

Be honest about what the score means, though. A reranker's relevance score is a useful relative ordering, not a calibrated probability of truth, so the right way to use it is as a soft signal in the prompt rather than a hard threshold the model applies blindly. Tell the agent, in plain language, that lower-scored chunks are weaker evidence and that a result set where everything scores low is a cue to hedge or re-query. Combined with the loopback in the diagram above, this gives you an agent that degrades gracefully: when grounding is thin, it asks again or admits uncertainty instead of confidently stitching together whatever the retriever happened to return.

```
[
  {"source": "acme-msa#14", "score": 0.91,
   "text": "Either party may terminate with 30 days notice."},
  {"source": "acme-msa#15", "score": 0.62,
   "text": "Notice must be in writing to the addresses in Ex. A."}
]
```

## When to apply each pattern

| Pattern | Use when | Skip when |
| --- | --- | --- |
| Cached doc block | Enriching many chunks per doc | One-off single chunk |
| retrieval_text vs text split | Always | Never skip this |
| Retrieval-as-tool | Agentic, multi-turn | Single-shot Q&A |
| Scored structured returns | Agent must self-calibrate | Display-only UI |

## Common pitfalls

- **Putting the chunk before the document in the prompt.** The cache prefix must be the document; reorder and caching silently stops working.
- **Returning one big concatenated string from the tool.** The agent loses per-chunk sources and scores and can no longer cite or self-calibrate.
- **Always retrieving on every turn.** For agentic systems, let the model decide; forced retrieval wastes tokens on "thanks, that helps" turns.
- **Writing a vague tool description.** Claude routes tool calls off the description — invest a sentence in *when*, not just *what*.
- **Indexing the original instead of the enriched text.** You lose the entire benefit of contextual retrieval while keeping all of its indexing cost.

## Frequently asked questions

### Should the enrichment prompt summarize or contextualize?

Contextualize. A summary restates what the chunk already says; context adds what the chunk is missing — which document, which section, which entity. Only the missing context improves retrieval, because the chunk text is already in the index.

### How do I keep these patterns consistent across services?

Package the enriched-chunk record and the `search_kb` tool schema as a shared library. Every service that needs grounding imports the same retriever and the same tool definition, so the agent contract never drifts between projects.

### Does retrieval-as-a-tool cost more than always retrieving?

Sometimes more, often less. The model skips retrieval on turns that do not need it (saving tokens) but may retrieve twice on hard ones. Net cost usually drops because most turns in a real conversation do not need fresh grounding.

## Bringing agentic AI to your phone lines

CallSphere wires these retrieval patterns into **voice and chat** agents that decide mid-call when to look something up, pull the exact record, cite it, and book the work — 24/7. See the patterns in action at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/contextual-retrieval-code-patterns-for-claude-agents
