Cut Token Cost in Contextual Retrieval Claude Agents
Keep contextual-retrieval RAG cheap and fast on Claude: prompt caching, the Batch API, reranking, and context budgets — with a clear cost decision table.
Contextual retrieval makes Claude agents noticeably smarter — and, if you are not careful, noticeably expensive. Every chunk you contextualize, every retrieval you stuff into the prompt, and every multi-turn tool loop adds tokens, and tokens are the meter that runs while your agent thinks. Teams that move from a prototype to real traffic often see their per-request cost double or triple, not because the model changed, but because the context grew unchecked. The good news: the levers that bring cost down — prompt caching, batching, and disciplined context budgets — also make the agent faster, because fewer tokens means lower latency.
This post is a practical guide to keeping contextual-retrieval runs cheap and fast on Claude, covering where the tokens actually go, how prompt caching changes the math, when to batch the contextualization step, and how to budget the live context window so an agent stays snappy under load.
Key takeaways
- The two big token sinks are the contextualization pass (one-time, at index build) and the live retrieval context (every request). Optimize them separately.
- Prompt caching is the single biggest lever for live cost: cache the stable prefix — system prompt, tool definitions, retrieved reference docs — so repeat reads are far cheaper than fresh input tokens.
- Batch the contextualization step with the Message Batches API; it is asynchronous, half-price, and ideal for indexing thousands of chunks where latency doesn't matter.
- Use a smaller model (Haiku) for the contextualization pass and reserve Opus or Sonnet for the live agent reasoning.
- Cap retrieved chunks (often 5–20, reranked) instead of dumping the top 50; precision beats volume and saves tokens.
Where the tokens actually go
Contextual retrieval has two distinct cost centers, and conflating them is the most common budgeting mistake. The first is index-time cost: before you can serve queries, you run each chunk through a model once to generate its context header ("This excerpt is from the enterprise SLA and defines the uptime credit schedule"). For a large corpus this is thousands or millions of small generation calls. It is a one-time or periodic cost, and crucially it is latency-insensitive — nobody is waiting on it.
The second is query-time cost: every user request pulls retrieved chunks into the prompt, plus your system prompt and tool definitions, plus the growing conversation history. This is what runs on every single request, so a few hundred wasted tokens here multiply across all your traffic. The whole optimization strategy follows from this split — make index-time cheap with batching and a small model, make query-time cheap with caching and tight budgets.
flowchart TD
A["Raw corpus"] --> B["Index-time: contextualize chunks"]
B --> C{"Latency matters?"}
C -->|No| D["Batch API + Haiku, async, half price"]
D --> E["Embedded & indexed"]
E --> F["Query-time: retrieve top-k, rerank"]
F --> G{"In cached prefix?"}
G -->|Yes| H["Cheap cache read"]
G -->|No| I["Full-price input tokens"]
H --> J["Claude answers"]
I --> J
Prompt caching: the biggest query-time lever
Prompt caching lets you mark a stable prefix of your prompt so that on subsequent requests Claude reads it from cache at a steep discount instead of re-processing it as fresh input. For agents this is transformative, because a large fraction of every prompt is identical request to request: the system prompt, the tool definitions, and often a set of reference documents that rarely change.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The pattern is to order your prompt from most-stable to least-stable and place a cache breakpoint after the stable part. Everything before the breakpoint gets cached; the user's actual question and the freshly retrieved chunks come after it.
messages = client.messages.create(
model="claude-sonnet-4-6",
system=[
{"type": "text", "text": LONG_SYSTEM_PROMPT},
{"type": "text", "text": STABLE_REFERENCE_DOCS,
"cache_control": {"type": "ephemeral"}}
],
tools=TOOL_DEFINITIONS, # also cacheable, keep stable
messages=conversation,
)
Two rules make caching pay off. First, keep the cached prefix byte-stable — even reordering tool definitions or changing whitespace invalidates the cache. Second, put volatile content (the user turn, retrieved chunks for this query) after the breakpoint, so the expensive part stays reusable. With a busy agent that shares a system prompt and reference set across all users, cache hit rates on the prefix can be very high, and the savings on input tokens are large.
Batching the contextualization pass
Generating context headers for a big corpus is the perfect job for the Message Batches API. You submit many independent requests as one batch, Anthropic processes them asynchronously within a window, and the price per token is roughly half the synchronous rate. Because indexing has no user waiting on it, the asynchronous turnaround is free upside.
Combine batching with model selection. The contextualization task — summarizing how a chunk fits its document in one sentence — does not need your most capable model. Use Haiku for the header generation and reserve Sonnet or Opus for live reasoning. The header quality from a small model is more than enough for retrieval precision, and the cost difference at corpus scale is dramatic.
One more index-time saving: only re-contextualize chunks that changed. Hash each chunk's raw text and skip regeneration when the hash matches, so a nightly re-index touches only new and edited content rather than the whole corpus.
Budgeting the live context window
The instinct under pressure is to retrieve more chunks "to be safe." This is backwards. Dumping the top 50 chunks into the prompt bloats tokens, slows the response, and — because of the lost-in-the-middle effect — often lowers answer quality. A tight retrieval budget of a handful of well-ranked chunks usually beats a fat one.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The reliable recipe is to over-retrieve cheaply, then rerank and trim. Pull the top 50–100 candidates from your vector and keyword search, run them through a reranker, and pass only the top 5–20 to Claude. You get the recall benefit of a wide net without paying to feed all of it into the expensive reasoning step. Also cap conversation history: summarize or drop turns once the running history exceeds a threshold, so a long session doesn't quietly balloon every subsequent request.
Ship a cheaper run in five steps
- Split your spend dashboard into index-time and query-time so you optimize the right one.
- Move contextualization to the Message Batches API with Haiku and skip unchanged chunks via content hashing.
- Restructure prompts: stable system prompt, tool defs, and reference docs first, with a cache breakpoint; volatile content after.
- Add a reranker and cap passed chunks to a small top-k instead of a large raw top-k.
- Set a conversation-history budget that summarizes or truncates old turns before they inflate every request.
Common pitfalls
- Invalidating the cache by accident. Injecting a timestamp or a per-user token into the cached prefix means you never get a cache hit. Keep the prefix identical across requests.
- Contextualizing with your biggest model. Running Opus on a million chunk headers is a waste; Haiku produces fine headers at a fraction of the cost.
- Over-retrieving for safety. More chunks means more tokens, more latency, and often worse answers. Rerank and trim.
- Ignoring conversation growth. A 40-turn session can carry a huge history into every request. Summarize or window it.
- Synchronous indexing. Building an index with live synchronous calls pays full price for work no user is waiting on. Batch it.
Cost decision table
| Lever | Where it helps | Typical effect | When to use |
|---|---|---|---|
| Prompt caching | Query-time | Large input-token savings on stable prefix | Shared system prompt / reference docs |
| Batch API + Haiku | Index-time | ~half price, async | Bulk contextualization |
| Rerank + trim top-k | Query-time | Fewer tokens, often better answers | Always |
| History windowing | Query-time | Stops per-request growth | Long sessions |
Frequently asked questions
Does prompt caching help if every user has different data?
Yes, as long as part of the prompt is shared. The system prompt, tool definitions, and any common reference material are identical across users and can be cached; only the per-user retrieved chunks and the question need to be fresh. Structure the prompt so the shared part is the cached prefix and the per-user part follows it.
Is contextual retrieval worth the extra index-time cost?
For most agents, yes. The contextualization pass is a one-time or periodic cost that you can batch at half price with a small model, and it meaningfully improves retrieval precision — which in turn lets you pass fewer chunks at query time. The query-time savings often offset the index-time spend.
How many chunks should I retrieve per query?
Over-retrieve to a wide candidate set cheaply, then rerank and pass a small top-k — commonly somewhere between 5 and 20 — to the model. The exact number depends on your domain, but passing dozens of raw chunks almost always costs more and answers worse than a reranked handful.
Will a smaller model for contextualization hurt retrieval quality?
Rarely in a way that matters. The contextualization task is a short, constrained summary of how a chunk relates to its document. Haiku handles that well, and any small loss in header nuance is outweighed by the ability to index your whole corpus affordably and refresh it often.
Faster, cheaper agents on every call
CallSphere brings these same efficiency patterns to voice and chat — caching stable context, retrieving tightly, and keeping latency low so agents answer instantly and at a sane cost on every call and message. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.