Claude Context Design for Caching: What to Keep and Cut
Design Claude context to cache well: a three-layer budget for what to keep, what to cut, and how to inject dynamic facts without breaking the cache.
Most caching advice stops at mechanics — where to put the breakpoint, how to read the usage fields. But the highest-leverage caching decision happens earlier, when you decide what goes into the context at all. Every token you include is a token that must either stay frozen to cache or vary and invalidate. Context design and caching design are the same discipline viewed from two angles: a prompt that is well-organized for the model to reason over is almost always well-organized for the cache, and vice versa. This post is about making those decisions deliberately.
We will work through what belongs in context and what does not, how to separate the durable from the disposable, and why cutting the wrong thing hurts both cache-hit rate and answer quality. The framing is opinionated: treat context as a layered budget, put the stable foundation first, and be ruthless about keeping per-request noise out of the part you want to cache.
Key takeaways
- Include what is reusable and reasoning-relevant; cut what is per-request noise. The same call improves caching and reduces distraction.
- Separate context into a frozen foundation (persona, policies, schemas), a per-session layer (user, retrieved docs), and a per-turn layer (history, latest message).
- Never interpolate timestamps, IDs, or session state into the system prompt — it caps your cache-hit rate at zero and adds nothing the model needs there.
- Prefer injecting dynamic facts as a late message or a
role: "system"message over rewriting the cached prefix. - Trim aggressively: redundant boilerplate both pollutes the model's attention and inflates the tokens you pay to cache.
What context actually is, and why placement is design
Context engineering is the practice of deciding which information the model can see for a given request and how that information is arranged, so the model has exactly what it needs to reason well and nothing that distracts it. Caching adds a second axis to every such decision: not just "should this be in context" but "how often does it change." Those two questions, answered together, determine both quality and cost.
The reason they align is that good context and good cache structure both favor stability. The information a model relies on across many requests — its instructions, the policies it enforces, the schemas of its tools — is exactly the information that should be frozen and cached. The information that distracts a model — a stale timestamp, a request ID, a one-off note — is exactly the per-request noise that invalidates the cache. So the discipline of cutting noise and freezing the foundation serves both goals at once.
The three-layer context budget
The practical model is to treat context as three layers ordered by change frequency, mapped directly onto the render order. The foundation layer never changes within a deployment: the agent's role, behavioral rules, policy text, and tool schemas. The session layer changes per user or per conversation: a profile, retrieved knowledge, account state. The turn layer changes every message: the conversation history and the newest input.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
| Layer | Contents | Cache behavior |
|---|---|---|
| Foundation | Persona, policies, tool schemas, few-shot examples | Frozen, cached once, read by every request |
| Session | User profile, retrieved docs, account context | Cached per session, read across that session's turns |
| Turn | History, latest message, dynamic facts | Recomputed per turn; only the tail is fresh |
Designing to this budget means each piece of incoming information gets assigned to a layer by its change frequency, and the layers are placed in stability order. The payoff is that a typical turn only pays full price for the newest message — the foundation and session layers are served from cache. Quality benefits too, because the model sees a clean, consistently structured prompt rather than a jumble of stable and volatile content interleaved.
flowchart TD
A["New information to add"] --> B{"Changes how often?"}
B -->|Never| C["Foundation layer — freeze + cache"]
B -->|Per session| D["Session layer — cache per session"]
B -->|Per turn| E["Turn layer — recomputed each turn"]
B -->|Per request noise| F{"Does the model need it?"}
F -->|No| G["Cut it"]
F -->|Yes| H["Inject as late message, never in system prompt"]The decision tree is the whole method. For every candidate piece of context, ask how often it changes; if the answer is "per request," ask whether the model genuinely needs it, and if it does, place it as late as possible instead of in the cached prefix. Most per-request content fails the need test and should simply be cut.
What to keep: reusable, reasoning-relevant context
Keep the things the model uses to make decisions across many requests. Tool schemas earn their place at position zero because the model consults them constantly and they never change. Policy and persona text belong in the foundation because they shape every response. Few-shot examples that demonstrate the desired output format are worth caching because they improve quality on every call and cost nothing after the first read.
Retrieved documents are the interesting middle case. A knowledge-base article that is relevant for an entire session belongs in the session layer, cached once and read across the conversation. The same article re-retrieved freshly on every turn — with a different ranking order or a timestamped header — would invalidate per turn and waste the savings. The design move is to retrieve once per session, normalize the rendering so it is byte-stable, and cache it behind a session breakpoint.
What to cut: per-request noise and redundancy
The most common mistake is leaving things in context that the model does not need and that change every request. A Current time: ... line at the top of the system prompt is the canonical example: it almost never affects the answer, and it interpolates a fresh value into the front of the prefix, guaranteeing a zero cache-hit rate for everything after it. If the model truly needs the time for a particular request, it goes in the latest user message, not the frozen system prompt.
Redundancy is the quieter cost. Boilerplate that restates the same instruction three different ways, verbose preambles, and copy-pasted policy fragments inflate both the token count you pay to cache and the surface area competing for the model's attention. Trimming them tightens the cache budget and sharpens reasoning simultaneously. A good test: if removing a line does not change the model's behavior, it was noise — and noise in the cached prefix is paid for on every cold write.
Injecting dynamic facts without breaking the prefix
Sometimes you genuinely need to deliver a fact mid-conversation — a mode toggled, the user supplied new context, the remaining budget dropped. The wrong move is to rewrite the system prompt, because that sits ahead of the entire history and reprocesses every cached turn. The right move is to append the fact after the cached history.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
messages = history + [
{"role": "user", "content": user_message},
{"role": "system",
"content": "Auto-approve mode is on; do not ask for confirmation."},
]A role: "system" message placed after the conversation carries operator authority but leaves the cached prefix intact, on models that support it. Phrase such injections as context rather than override commands — state the fact and let the model act on it. This pattern lets the foundation and session layers stay frozen and cached while still steering behavior in real time, which is exactly the balance context design is trying to strike.
Common pitfalls
- A timestamp or session ID in the system prompt. Caps cache-hit rate at zero and rarely helps the answer. Move it to the latest turn or cut it.
- Re-retrieving and re-rendering session documents every turn. Invalidates the session cache repeatedly. Retrieve once, normalize, cache behind a session breakpoint.
- Hoarding context "just in case." Extra tokens distract the model and inflate cache-write cost. Cut anything that does not change behavior.
- Rewriting the system prompt to deliver a mid-session fact. Reprocesses the whole history. Append a
role: "system"message after the cached turns instead. - Conditional system sections. Each flag combination is a distinct prefix; branch in message content, not in the cached foundation.
Frequently asked questions
Should I cache retrieved documents or keep them out of the prompt?
If a document is relevant for the whole session, retrieve it once, render it deterministically, and cache it in the session layer — it is read across every turn and pays for itself quickly. If a document is relevant only to a single turn, include it inline in that turn without a breakpoint; caching a one-off document only charges the write premium with no read.
Does cutting context to improve caching ever hurt answer quality?
Cutting genuine signal would, but the context you cut for caching is almost always noise — timestamps, IDs, redundant boilerplate — which also distracts the model. The two goals align: a leaner, stable prompt caches better and reasons more clearly. Only cut what does not change behavior, and quality is preserved or improved.
Where do few-shot examples belong in the layered budget?
In the foundation layer, behind the same breakpoint as the system prompt, because they are fixed across requests and improve output quality on every call. Cached once, they are read for free thereafter, which makes generous few-shot prompting far cheaper than it looks.
How do I deliver per-user data without a per-user prefix?
Put it in the session layer behind its own breakpoint, after the shared foundation. The foundation stays identical and cached across all users, while each user's profile and context cache separately for that session. This gives cross-user sharing on the foundation and per-session reuse on the personal data without ever putting user data in the global prefix.
Bringing agentic AI to your phone lines
CallSphere designs context the same layered way for voice and chat agents — assistants that answer every call and message, use tools mid-conversation, and book work 24/7 on a prompt that stays stable enough to cache. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.