Context Engineering for Claude Agents: What to Include (Building AI Agents For Startups)
Prompt and context design for Claude agents: what to include, what to leave out, compaction, and tuning context with evals to keep agents sharp.
Every Claude agent has one truly scarce resource, and it is not the model's intelligence — it is the context window's attention. The model is only ever as good as what you put in front of it on a given turn, and the most common reason capable agents give mediocre answers is not a weak prompt but a cluttered context. Context engineering is the deliberate practice of curating that window: deciding what earns a place in it, what stays out, and how to keep the signal high as a task unfolds. This post is about doing that well.
Why context is a budget, not a bucket
It is tempting to treat a large context window — Claude Code offers up to a million tokens — as a place to dump everything that might help. That instinct backfires. Models attend less reliably to information buried in a sea of marginally relevant text, and every token you add is a token competing for the model's focus and adding to your cost. The right mental model is a budget you spend, not a bucket you fill. The goal of context engineering is to find the smallest set of high-signal tokens that reliably produces the behavior you want.
Context engineering is the practice of curating exactly what enters a model's context window on each turn — instructions, retrieved data, tool results, and history — so the model has the relevant information and nothing that distracts from it. Framed that way, the discipline has two halves: a curation half (what to include) and a hygiene half (how to keep it clean as the conversation grows). Both matter, and most agents that degrade over a long session are failing the second half.
What belongs in the window
Four things reliably earn their place. First, the system prompt — the agent's role, rules, and operating boundaries, written specifically enough to be operational. Second, tool definitions — but only the tools relevant to the current task, since each unused tool is selection noise. Third, the active task context — the specific user, the specific records, the immediate question — fetched just in time rather than pre-loaded. Fourth, recent reasoning and tool results that the next step actually depends on.
flowchart TD
A["New turn begins"] --> B["Include: system prompt & relevant tools"]
B --> C["Fetch active task data just in time"]
C --> D{"History near budget?"}
D -->|No| E["Send curated context to Claude"]
D -->|Yes| F["Compact old turns into summary"]
F --> G["Drop raw history, keep decisions"]
G --> E
E --> H["Claude reasons on high-signal context"]The unifying principle is relevance over completeness. A window holding the one policy that applies and the one order in question beats a window holding the entire policy manual and the customer's full order history. When the agent needs something it does not have, the answer is a tool call to fetch it, not a bigger upfront dump. This is why just-in-time retrieval and lean context are two sides of one coin.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
What to deliberately leave out
Knowing what to exclude is the harder skill. Leave out raw, unsummarized tool output — a 50KB API response should be trimmed to the fields the agent uses before it re-enters the window. Leave out stale history from earlier phases of a task that have already concluded; once a sub-goal is done, its blow-by-blow no longer needs to occupy space. Leave out speculative knowledge — documents included "just in case" that the task may never touch. And leave out duplicated instructions that say the same thing three ways.
A useful test: for each chunk of context, ask whether the next decision actually depends on it. If you cannot name the decision it informs, it is probably noise. This is also where Skills earn their keep — specialized procedures load only when the task triggers them, so your idle context stays lean instead of carrying instructions for every capability the agent might someday use. Exclusion is not deprivation; it is making room for the model to attend to what matters.
Keeping context clean over a long run
Long-running agents fail in a characteristic way: the window fills with the residue of completed steps, signal-to-noise drops turn by turn, and reasoning gets foggier until the agent loses the thread. The remedy is compaction — periodically summarizing earlier turns into a compact synopsis that preserves decisions, commitments, and open questions while discarding the verbose history that produced them. Done well, compaction holds context cost roughly flat and keeps the agent as sharp at turn fifty as at turn five.
Two companions strengthen compaction. A persistent plan or scratchpad stored outside the conversation gives the agent a stable anchor that survives summarization, so even after raw history is gone, the agent re-reads its goals and remains coherent. And structured note-taking — having the agent record key findings to a durable place it can re-read — lets it offload memory from the window to storage. Together these turn the context window back into what it should be: scarce working memory, not an ever-growing transcript.
Tuning context with evals, not vibes
Context decisions should be measured, not guessed. The reliable method is an eval set — representative tasks with expected behaviors — that you run as you add or remove context. Discover that an instruction changes no outcomes? Delete it; it was only costing tokens and attention. Find that trimming a tool result breaks a behavior? Restore the field that mattered. This turns context engineering from folklore into engineering, and it compounds: each pruned token makes the next decision cheaper and clearer.
The payoff of treating context as a tuned, measured budget is an agent that stays fast, cheap, and reliable as its world grows. Most teams discover that aggressively cutting context — fewer tools, leaner results, disciplined compaction — improves quality rather than harming it, because the model finally gets to attend to the things that matter. Less, curated well, beats more.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Frequently asked questions
Does a bigger context window mean I can stop curating?
No. A larger window raises the ceiling but does not remove the cost of clutter — attention still degrades when relevant facts are buried, and you still pay per token. Even with a million-token window, curate for relevance; the window is a budget, not a reason to skip the work.
What is the difference between context engineering and prompt engineering?
Prompt engineering focuses on the wording of instructions. Context engineering is broader: it governs everything in the window each turn — instructions, tools, retrieved data, history — and how that set evolves over a long task. Prompt engineering is one slice of context engineering.
How do I keep a long conversation from degrading?
Compact older turns into summaries that preserve decisions and open questions, store the plan outside the conversation as a stable anchor, and have the agent take durable notes it can re-read. These keep signal high and cost roughly flat across a long run.
How do I know what to remove from context?
For each chunk, name the decision it informs; if you cannot, it is likely noise. Then confirm with an eval set — remove the chunk, rerun, and keep the removal if behavior holds. Measure rather than guess.
Bringing agentic AI to your phone lines
CallSphere applies this context discipline to voice and chat — keeping each call's window lean and relevant so the agent answers fast, uses tools mid-conversation, and books work without losing the thread. Hear it for yourself at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.