Skip to content
Agentic AI
Agentic AI8 min read0 views

Context Design for Retrieval-Grounded Claude Agents

What to put in a Claude agent's context and what to cut: ordering reranked chunks, deduping, citations, and grounding rules for better RAG answers.

You can build a flawless contextual-retrieval pipeline and still get mediocre answers if you dump the results into the model carelessly. The context window is a scarce, attention-limited resource, and how you fill it determines whether Claude grounds its answer in the right paragraph or drowns in five half-relevant ones. This post is about the last mile: context design — deciding what goes into the window, in what order, and what you deliberately leave out.

The instinct to maximize recall ("include everything that might be relevant, just in case") is exactly backwards for agents. More chunks dilute attention, lengthen prompts, and increase the odds the model anchors on a passage that merely looks relevant. Good context design is mostly subtraction.

Key takeaways

  • Precision beats volume — pass the 4–8 reranked chunks that matter, not the top 20 from fusion.
  • Order chunks by relevance and put the strongest evidence where the model attends best.
  • Strip the generated context sentence before display; show Claude the original chunk plus a real citation.
  • Tell the agent explicitly to answer only from context and to say when the answer is absent.
  • Reserve the system prompt for stable rules and the user turn for the query and chunks — don't blur them.

What belongs in the window

Three things, and not much else: the agent's stable instructions (its role and grounding rules), the reranked retrieved chunks with their source ids, and the user's actual question. Everything beyond that is overhead competing for attention. If a chunk did not survive reranking, it does not belong in the window — that is the entire point of reranking.

Each chunk should arrive as clean source text with a citation handle, not as the enriched string you indexed. The context sentence you generated during enrichment did its job at retrieval time; showing it to the model now just adds words it has to read past. Display the original; cite the source.

There is a budgeting mindset that helps here. Think of the context window as having a fixed attention budget that every token spends down, whether or not that token earns its place. A chunk that is only loosely relevant does not merely fail to help — it actively competes with the chunks that do, pulling some of the model's attention toward a passage that looks on-topic but answers a slightly different question. That is why the discipline is subtractive: you are not asking "what could conceivably be useful," you are asking "what can I remove and still answer correctly." The smallest context that fully supports the answer is almost always the best one.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

What to leave out, and why

Leave out low-scoring chunks, duplicate passages, and the retrieval machinery's internal artifacts. A surprising amount of bad grounding comes from near-duplicate chunks: the same fact appears three times with slightly different wording, the model over-weights it, and a single source masquerades as a consensus. Deduplicate before you assemble the window.

flowchart TD
  A["Reranked candidates"] --> B["Drop below score threshold"]
  B --> C["Deduplicate near-identical chunks"]
  C --> D["Order by relevance"]
  D --> E["Attach source ids"]
  E --> F["Assemble: rules + chunks + query"]
  F --> G["Claude answers, cites, or says unknown"]

Read this as a funnel that keeps narrowing. Each stage removes something — weak scores, duplicates, then irrelevant ordering — so that by the time the window is assembled, every token in it is pulling its weight. Nothing in the funnel adds chunks; it only ever subtracts or reorders.

How ordering changes the answer

Models do not attend uniformly across a long context. Evidence positioned at the edges of the chunk block tends to land harder than evidence buried in the middle of a long list. So order matters: put your highest-confidence chunk first, and avoid a giant undifferentiated wall where the best evidence sits in the soft middle. With only 4–8 well-reranked chunks, this is easy — which is another reason to rerank hard before assembling.

A practical rule: if you find yourself wanting to include a tenth chunk "to be safe," your reranker is undertuned. Fix retrieval upstream rather than compensating by flooding the window downstream.

Deduplication deserves more care than it usually gets, because the failure it prevents is invisible in the logs. When the same fact appears three times in slightly different wording — a clause repeated across an original contract and two amendments, say — the model sees apparent corroboration and over-weights it, and a lone source masquerades as a consensus of three. Worse, if one of those near-duplicates is subtly outdated, the repetition can drown out the single correct, current chunk. Collapse near-identical passages before assembly, keep the highest-scoring representative, and you remove both the false-consensus effect and a whole class of stale-answer bugs in one step.

The grounding instruction that keeps agents honest

Context design is not only about the chunks — it is also about the instruction that frames them. The system prompt should state plainly that the agent answers only from the provided context, cites the source for each claim, and explicitly says when the context does not contain the answer. That last clause is what converts a thin retrieval into an honest "I don't know" instead of a confident fabrication.

SYSTEM = (
  "Answer ONLY using the chunks provided in the user turn. "
  "Cite the [source_id] after each claim. If the chunks do "
  "not contain the answer, say you don't have that "
  "information and suggest what to search for next."
)

The "suggest what to search for next" clause matters in an agentic loop: it nudges Claude to issue a refined retrieval rather than stall, turning a miss into a productive second query instead of a dead end.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Phrasing the grounding rule as a hard constraint rather than a gentle preference changes behavior more than most teams expect. "Prefer the provided context" leaves the model room to fall back on its parametric memory, which is exactly where stale or plausible-but-wrong facts sneak in. "Answer only from the provided context" closes that door, and pairing it with an explicit escape hatch — say that you do not have the information — keeps the closed door from forcing a fabrication when the context genuinely falls short. The combination is what makes a grounded agent trustworthy: it cannot wander off its sources, and it has a sanctioned way to admit a gap instead of papering over it. Test this directly by asking questions you know the corpus does not cover and confirming the agent declines rather than guesses.

System prompt vs. user turn — keep them separate

Stable rules go in the system prompt; volatile content — the chunks and the question — goes in the user turn. Mixing them is a common, subtle mistake: teams paste retrieved chunks into the system prompt, and then every turn rebuilds the system prompt, which defeats prompt caching and bloats the stable layer with transient data. Keep the system prompt fixed and cacheable; keep retrieval in the turn.

Goes in contextStays out
Top 4–8 reranked chunksThe other 12+ candidates
Original chunk textThe enrichment context sentence
Source ids for citationInternal scores, vectors
Stable grounding rulesPer-turn chunks in system prompt

Common pitfalls

  • Maximizing recall in the window. Twenty chunks dilute attention. Pass the few that survived reranking and nothing else.
  • Leaving duplicates in. Near-identical chunks let one source impersonate a consensus and skew the answer.
  • Showing the enrichment sentence as source text. It was for retrieval; the model should read and cite the original.
  • Putting retrieved chunks in the system prompt. It breaks caching and pollutes the stable layer with per-turn data.
  • Omitting the "say when unknown" rule. Without it, thin context becomes confident hallucination — the failure mode grounding was supposed to prevent.

Frequently asked questions

How many chunks should I actually pass?

Usually four to eight after reranking. The right number depends on chunk size and question complexity, but if you are tempted past ten, tune the reranker instead. Precision in the window beats raw coverage almost every time.

Should I include source metadata in the prompt?

Include a compact source id per chunk so the model can cite it, and keep the rest of the metadata out. Internal scores and vectors are for your retriever's logic, not for the model's reasoning, and they only consume attention.

Does context design interact with prompt caching?

Strongly. Keep stable instructions in a cacheable system prompt and put volatile chunks in the user turn. That way the expensive, fixed portion is cached across turns while only the small retrieval payload changes each time.

Bringing agentic AI to your phone lines

CallSphere applies this disciplined context design to voice and chat agents that ground every answer in the right record, cite their source, and admit when they need to check — then book the work 24/7. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.