Claude Context Design: What to Include and Omit
Practical context and prompt design for Claude agents: what to include, what to leave out, and why lean context is more reliable, cheaper, and faster.
There is a seductive but wrong instinct in agent building: when an agent gets something wrong, add more to its context. More instructions, more examples, more retrieved documents. It feels like you are helping. Often you are making it worse. A context window stuffed with marginally-relevant material does not make Claude smarter; it dilutes the signal, raises cost, and increases the odds the model anchors on the wrong detail. Good context design is as much about what you leave out as what you put in.
This post is a practical guide to deciding what belongs in Claude's context on any given turn. It is the discipline that separates agents that stay sharp at scale from ones that slowly degrade as their prompts accrete. The principles apply whether you are building a coding agent, a support bot, or a research assistant.
Key takeaways
- Context is a budget, not a bucket — every token you add competes for the model's attention.
- Include only what this turn needs: the task, the few relevant facts, and the tools to act.
- Leave out stale history, redundant docs, and "just in case" instructions — they cost accuracy and money.
- Use retrieval to pull facts on demand instead of pre-loading everything.
- Put durable knowledge in Skills and durable facts in memory, so context stays lean per turn.
Why more context can hurt
It is tempting to treat the context window as free space — a 1M-token window invites you to fill it. But the model has to attend across everything you include, and relevance is not uniform. When the genuinely important instruction sits between paragraphs of tangential background, it competes for attention with all of it. Teams routinely find that trimming an over-stuffed prompt improves answer quality, because the signal-to-noise ratio went up. Bigger windows raise the ceiling on what you can include; they do not change the principle that you should include what is relevant.
Cost and latency compound the case. Every token in context is paid for on every turn and adds to time-to-first-token. An agent that re-sends 40,000 tokens of history it never references is burning money and patience for no accuracy benefit. The right question on each turn is not "what could possibly help?" but "what does this specific step need?"
A working definition of context engineering
Context engineering is the deliberate practice of curating what enters a model's context window on each turn — selecting the relevant instructions, facts, and tools, ordering them effectively, and excluding everything that does not earn its place. It treats the context window as a scarce resource to be allocated, not a scratchpad to be filled. Framed this way, the job becomes a series of include/exclude decisions you can reason about explicitly.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["New turn"] --> B{"Needed for THIS step?"}
B -->|Core task| C["Include: instruction + goal"]
B -->|Relevant fact| D["Retrieve & include top-k"]
B -->|Occasional how-to| E["Load Skill on demand"]
B -->|Stale or tangential| F["Exclude"]
C --> G["Assemble lean context"]
D --> G
E --> G
G --> H["Claude reasons & acts"]What to include
Three things almost always earn their place. First, the task and constraints — what the agent should do this turn and the hard rules it must respect, stated concisely. Second, the relevant facts — the specific records, code, or documents this step touches, retrieved on demand rather than pre-loaded. Third, the tools the agent might need, defined clearly so it can act. Keep each of these tight; a focused instruction beats a verbose one, and three precise documents beat twenty loosely-related chunks.
For retrieval specifically, rank and trim aggressively. If your retriever returns ten chunks, including the top three by relevance usually produces better answers than including all ten, because the bottom of the list is mostly noise that competes for attention. Quality of selection beats quantity of inclusion almost every time.
There is a positional dimension too. Information placed at the very start or very end of a long context tends to be attended to more reliably than material buried in the middle. So when you do include several pieces, order them with the most important first rather than scattering the critical instruction halfway down a long block. The practical rule: decide what the single most important thing for this turn is, put it where the model will see it clearly, and let everything else earn a place beneath it or not appear at all.
What to leave out
Several things masquerade as helpful but mostly hurt. Stale conversation history: once a sub-task is resolved, its back-and-forth rarely needs to ride along on every future turn — summarize it and drop the transcript. Redundant documents: three sources saying the same thing add tokens, not information. "Just in case" instructions: rules for situations this agent never encounters dilute the ones that matter. Whole files when a function will do: for coding agents, pulling an entire module when the task touches one function buries the relevant code.
| Item | Include? | Better alternative |
|---|---|---|
| This turn's task & constraints | Yes | Keep it concise |
| Top-k relevant retrieved facts | Yes | Rank and trim hard |
| Full conversation history | No | Summarize resolved threads |
| Every related document | No | Dedupe to the few that matter |
| Occasional procedures | No, in prompt | Load as a Skill on demand |
Patterns that keep context lean
Three patterns do most of the heavy lifting. Summarize and compact: when a long sub-task finishes, replace its transcript with a short summary of the outcome so future turns carry the conclusion, not the deliberation. Retrieve on demand: instead of pre-loading a knowledge base, let the agent fetch the specific record it needs via a tool, so context contains only what was actually used. Externalize the durable: put recurring procedures in Skills and persistent facts in a memory store, both pulled in only when relevant. Together these keep each turn's context proportional to the work the turn is doing.
Compaction is especially powerful for long-running agents. A coding agent that has been working for an hour may have accumulated dozens of tool results and dead-ends; carrying all of that forward both costs tokens and risks the model re-anchoring on an abandoned approach. Periodically compacting the history into "here is what we have established and what remains" keeps the agent oriented on the current state of the work rather than re-living every step that got it there. The summary becomes the working memory; the raw transcript can be dropped or archived for audit without riding along in context.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Tune your context in 5 steps
- Instrument token counts per turn and find the requests carrying the most context.
- For the biggest ones, audit what the model actually referenced versus what you sent.
- Cut stale history and redundant docs; replace resolved threads with summaries.
- Move occasional procedures into Skills and durable facts into a memory store.
- Switch pre-loaded knowledge to on-demand retrieval with aggressive top-k trimming, then re-run your evals to confirm quality held or improved.
Common pitfalls
- Treating the big window as free. A 1M-token window is a ceiling, not a target. Relevance still rules.
- Never compacting history. Letting transcripts accumulate degrades both cost and reasoning. Summarize resolved threads.
- Including all retrieval hits. The long tail of a retrieval result is mostly noise. Trim to the top few.
- Stuffing every rule into the prompt. Rules for situations that never occur dilute the ones that do. Scope instructions to the agent's real job.
- Sending whole files to coding agents. Provide the relevant function or section, not the module, so the signal stays on top.
Frequently asked questions
Doesn't a 1M-token context window mean I can stop worrying about this?
No. A larger window raises what you can include, but the model still attends across everything you put in, and irrelevant material still competes for attention and costs tokens. Curate regardless of window size.
How do I decide between retrieval and pre-loading?
Pre-load only small, stable, always-needed material (it also caches well). Use retrieval for anything large or situational, so context contains the specific facts a turn used rather than the whole corpus.
What's the difference between context, Skills, and memory?
Context is what is in the window this turn. Skills are procedures Claude loads on demand when relevant. Memory is a persistent store of facts the harness queries and selectively injects. Skills and memory exist so context can stay lean.
How do I know if I'm including too much?
Trim and measure. If cutting stale history or redundant documents leaves eval scores flat or improved while lowering cost and latency, that material was noise. Make trimming an ongoing habit, not a one-time cleanup.
Bringing agentic AI to your phone lines
CallSphere uses disciplined context design to keep voice and chat agents fast and accurate on every call — pulling exactly the customer facts a moment needs and nothing more. Experience lean, sharp agents at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.