---
title: "Claude Context Design: What to Include and Leave Out"
description: "Design context for Claude agents the right way — what belongs in the window, what to exclude, and why curation beats stuffing for accuracy and cost."
canonical: https://callsphere.ai/blog/claude-context-design-what-to-include-and-leave-out
category: "Agentic AI"
tags: ["agentic ai", "claude", "context engineering", "prompt engineering", "anthropic", "rag"]
author: "CallSphere Team"
published: 2026-04-25T09:32:44.000Z
updated: 2026-06-07T01:28:22.494Z
---

# Claude Context Design: What to Include and Leave Out

> Design context for Claude agents the right way — what belongs in the window, what to exclude, and why curation beats stuffing for accuracy and cost.

Ask an experienced Claude engineer what most determines whether an agent works, and you will rarely hear "the prompt." You will hear "the context." Everything the model knows in a given turn lives in its context window, and the difference between a sharp agent and a confused, expensive one is almost always what made it into that window and what was wisely kept out. Even with Claude Code's roughly one-million-token capacity, more is not better — the entire window is reprocessed every turn, so each token you add costs latency, money, and a little reasoning clarity. Context design is the discipline of spending that budget well.

## Key takeaways

- Context is a **budget, not a bucket**: every token is reprocessed each turn, so relevance beats volume.
- Keep **stable, always-needed material** (role, constraints, tool schemas) at the front so prompt caching can reuse it cheaply.
- **Retrieve narrowly** — pull the few relevant facts, not the whole knowledge base.
- **Summarize and compact** old turns and raw tool output before they crowd out the task.
- Deliberately **exclude** stale data, redundant docs, and anything the model could be tempted to over-trust.

## Why "just add more context" fails

The intuition that a bigger window means a smarter agent is the most expensive mistake in this field. There are three concrete costs. First, money and latency: the model reads the whole window on every single turn, so a multi-turn agent re-pays for bloated context many times over. Second, dilution: when the few facts that matter are buried among thousands of marginally relevant tokens, the model's attention is split and accuracy drops. Third, false confidence: irrelevant or stale material in context can lead the model to cite or act on the wrong thing because it was "right there."

Context engineering is the practice of deciding what information enters a model's context window, in what form, and what is deliberately excluded, so the model has exactly what it needs and no more. Framed that way, the goal is obvious: a small, high-signal window beats a large, noisy one almost every time.

## A layout that holds up

Think of the window as ordered layers, from most stable to most volatile. The stable layers go first because Claude's prompt caching can reuse a long, unchanging prefix at a fraction of the cost. The volatile layers — the user's current request and freshly retrieved facts — go last. The diagram shows how a single turn assembles its window and where the gates that keep it lean live.

```mermaid
flowchart TD
  A["Incoming turn"] --> B["Stable prefix: role, constraints, tool schemas"]
  B --> C["Skills loaded only if task-relevant"]
  C --> D{"Need external facts?"}
  D -->|Yes| E["Retrieve top-k relevant chunks"]
  D -->|No| F["Skip retrieval"]
  E --> G["Compact prior turns into summary"]
  F --> G
  G --> H["Append current user request"]
  H --> I["Send curated window to Claude"]
```

Each branch is a decision about inclusion. Skills load only when relevant, so a thirty-tool capability set does not sit in every prompt. Retrieval is conditional and top-k bounded. Prior turns get compacted into a running summary rather than carried verbatim. The result is a window that stays roughly constant in size even as a long session accumulates history.

## What to include

Three things earn a permanent place. The **role and constraints** — who the agent is and the hard rules it must never break — anchor behavior and are the first thing you check when something goes wrong. The **tool schemas** the agent might need this turn, because the model can only call what it can see. And the **current request plus its directly relevant facts**: the specific order, the specific policy clause, the specific file section the task touches. Everything in this group is high-signal by construction.

A useful test for any candidate addition: would removing it change the correct answer for this turn? If not, it probably does not belong in the window right now. It can live in a tool, a skill, or a retrieval index and be pulled in only when a turn actually needs it.

## What to leave out

The harder discipline is exclusion. Leave out entire reference documents when a paragraph will do. Leave out raw tool dumps — a handler that returns ten thousand rows should be summarized to the relevant few before its output enters context. Leave out stale conversation history that a summary already captures. Leave out duplicate or near-duplicate sources that add tokens without adding facts. And be cautious with anything the model might over-trust: outdated policy text or a half-finished draft sitting in context can quietly steer an answer wrong.

This is where subagents become a context tool, not just a scaling tool. When a subtask is unavoidably noisy — exploring many files, trying several queries — run it in a subagent with its own window and let it return a tight summary. The orchestrator's context never sees the noise, only the conclusion.

## Order matters as much as content

Two windows can contain the exact same tokens and produce different answers depending on how those tokens are arranged. The model reads top to bottom, and material near the instruction it is currently following tends to weigh more heavily. The practical rule is to place the most decision-relevant facts close to the task statement, not buried three thousand tokens up. If an agent keeps ignoring a constraint, check where the constraint sits — a rule stranded at the top of a long window competes with everything after it.

This is why the stable-to-volatile layout is more than a caching trick. Putting the current request and its directly relevant facts last means they sit right where the model's attention is sharpest as it composes a response. It also pairs cleanly with retrieval: the chunks you pull for this turn belong near the question they answer, not interleaved with reference material from earlier turns. When you debug a context problem, ask two questions in order — is the right fact present, and is it in the right place? The second question is the one most teams forget to ask.

## Include or exclude: a quick reference

| Material | Default | Why |
| --- | --- | --- |
| Role & hard constraints | Include (front) | Anchors behavior, cacheable |
| Relevant tool schemas | Include | Model can only call what it sees |
| Top-k retrieved facts | Include (back) | High signal for this turn |
| Full reference docs | Exclude | Retrieve the relevant slice instead |
| Raw tool dumps | Exclude | Summarize to what matters |
| Stale history | Exclude | Replace with a running summary |

## Common pitfalls

- **Pasting whole documents.** Putting an entire manual in context to answer one question dilutes attention and inflates every turn's cost. Retrieve the relevant section.
- **Carrying full history forever.** Long sessions balloon if every turn is kept verbatim. Compact older turns into a summary you keep updating.
- **Returning unbounded tool output.** A tool that dumps thousands of rows poisons the window. Trim and structure results in the handler.
- **Putting volatile content in the cached prefix.** If per-request data sits at the front, you lose the caching benefit. Keep the prefix stable and append the variable parts.
- **Leaving contradictory or stale facts in context.** The model may anchor on the wrong one. Remove superseded material rather than hoping the model ignores it.

## Tighten your context in 5 steps

1. Order the window **stable-to-volatile** and verify the prefix never changes per request.
2. Replace any full document with a **retrieval step** that returns only relevant chunks.
3. Cap tool handlers so they return **summaries, not raw dumps**.
4. Add a **turn-compaction** step that rolls older history into a summary.
5. Move noisy exploration into a **subagent** that returns a conclusion, not its scratch work.

## Frequently asked questions

### If Claude has a million-token window, why curate at all?

Because the whole window is reprocessed every turn, large context means higher cost and latency on each iteration and weaker focus on the facts that matter. Capacity is a ceiling, not a target — use the smallest window that answers the question well.

### How do I keep a long conversation from filling the window?

Compact older turns into a running summary and keep only recent, relevant exchanges verbatim. The summary preserves what the agent learned without paying for the full transcript on every turn.

### What is the single highest-leverage context change?

Replacing document-stuffing with narrow retrieval. Pulling the few relevant chunks instead of whole sources usually improves accuracy and cuts cost at the same time — the rare change that helps on both axes.

### How do I know if my context is too big?

Watch token counts per turn and latency as a session grows. If they climb steadily while answer quality flattens or dips, the window is accumulating noise. Add turn compaction and tighten retrieval until per-turn token use stabilizes — a healthy agent's context size should plateau, not grow without bound across a long conversation.

## Bringing agentic AI to your phone lines

CallSphere applies this context discipline to **voice and chat** agents, giving each call exactly the customer facts and tools it needs and nothing that would slow it down. Hear curated context in action at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/claude-context-design-what-to-include-and-leave-out