Skip to content
Agentic AI
Agentic AI8 min read0 views

Prompt Caching Architecture in Claude Code, Explained

How prompt caching works inside Claude Code: cache breakpoints, prefix matching, TTLs, and the append-only loop that keeps agents fast and cheap.

The first time you watch Claude Code tear through a large codebase, the surprising part isn't the reasoning — it's the speed. An agent that rereads a 40,000-token system prompt, a tool catalog, and half a dozen files on every single turn should be painfully slow and expensive. It isn't. The reason is prompt caching, and once you understand how the pieces fit together, almost every other architectural decision in an agentic system starts to make sense. This post walks the full internal path: what gets cached, where the breakpoints sit, how prefix matching works, and why the entire agent loop is built around protecting that cache.

What prompt caching actually is

Prompt caching is a server-side optimization that lets the model reuse the computed attention state for a stable prefix of your prompt instead of recomputing it on every request. When you send a request with a cache breakpoint, Anthropic's infrastructure hashes the token prefix up to that point, stores the intermediate representation, and on the next request that shares the same prefix it loads the stored state rather than re-encoding those tokens. You pay a small premium to write the cache and a steep discount — often around a tenth of the base input price — to read it.

That single mechanism reshapes the economics of long-running agents. A Claude Code session is fundamentally a loop: the model reads a fat, stable context (system instructions, tool definitions, project facts) and then appends a thin, changing tail (the latest tool result, the newest user message). Without caching, every iteration re-pays for the fat part. With caching, the fat part is written once and read for pennies on every subsequent turn. The architecture exists to keep that fat prefix byte-for-byte identical across turns.

The layered prefix: how context is ordered

The key architectural insight is that caching only works on an exact, contiguous prefix. The moment a single token changes early in the prompt, every cached block after it is invalidated. So Claude Code lays out its context in strict order of volatility — most stable first, most volatile last — and places cache breakpoints at the boundaries.

In practice the layers look like this: the system prompt and behavioral rules sit at the very top because they almost never change within a session. Tool and MCP server schemas come next; they change only when a server connects or disconnects. Then comes durable project context — directory structure, key file contents, conventions pulled from a project memory file. Only after all of that does the live conversation transcript begin, growing turn by turn. Breakpoints are placed at the end of each stable layer so the system, tools, and project context each become their own reusable cache segment.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["New turn begins"] --> B["Assemble prompt: system + tools + project + transcript"]
  B --> C{"Prefix matches a live cache?"}
  C -->|Yes, hit| D["Load stored attention state (cheap read)"]
  C -->|No / changed| E["Recompute prefix & write new cache (premium)"]
  D --> F["Model processes only the new tail tokens"]
  E --> F
  F --> G{"Tool call or final answer?"}
  G -->|Tool call| H["Append result < refresh TTL >"] --> A
  G -->|Answer| I["Return to user, keep cache warm"]

Cache breakpoints and the TTL clock

A cache breakpoint is an explicit marker you attach to a content block telling the server "hash everything up to here and store it." You get a limited number of breakpoints per request, so they are spent deliberately on the layer boundaries that matter. Each cached segment also carries a time-to-live. The default window is short — on the order of a few minutes — which is enough to cover the rapid back-and-forth of an active agent loop. There is also a longer-lived option for sessions that pause between bursts, traded off against a slightly higher write cost.

The subtle part is that every cache read refreshes the TTL. As long as the agent keeps taking turns inside the window, the prefix stays warm indefinitely. The danger zone is idle time: if a developer steps away and the window lapses, the next turn pays the full write price again to rewarm the prefix. Well-built agents treat the TTL as a resource to be defended — they avoid unnecessary pauses mid-task and batch related work so the cache never goes cold between dependent steps.

Why append-only is non-negotiable

This is where caching dictates agent design rather than merely optimizing it. Because invalidation cascades from the first changed token, an agent must treat its context as append-only. You never edit an earlier message to "clean it up." You never reorder tool definitions between turns. You never inject a freshly fetched timestamp near the top of the system prompt. Any of those mutations would silently blow away the cache for everything downstream and turn a cheap turn into an expensive one.

Real agents that need to compact a bloated transcript do it carefully: they summarize the old tail into a new block appended at the end, rather than rewriting history in place — accepting one cache rewrite as the price of a smaller ongoing prefix. The mental model that keeps engineers out of trouble is simple: the prompt is a log, not a document. You add to the bottom; you do not rewrite the top.

How tool results flow through the cache

The agentic loop is a tight cycle of model output and tool input, and caching shapes both ends. When Claude decides to call a tool, that tool-use block is appended to the transcript. The tool runs, and its result is appended right after. On the next request, everything up to and including the previous result is now part of the cacheable prefix, while the model's fresh reasoning becomes the new tail. So the cache boundary marches forward one exchange at a time, always trailing just behind the latest activity.

This is why large, stable tool outputs are friendlier to caching than chatty, ever-changing ones. A tool that returns a deterministic file snapshot lets that snapshot become a durable cached block. A tool that injects a live clock or a random request ID into its output poisons the prefix downstream, because the next turn's prefix now contains a value that will never recur. Designing tools to return stable, content-addressable results is, in effect, designing for cacheability.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

The end-to-end picture

Put the layers together and the architecture is coherent. A stable system prompt and tool catalog form a long-lived cache base. Project context layers on top, refreshed only when files materially change. The conversation grows append-only beneath them, with the cache boundary creeping forward each turn and the TTL refreshing on every read. The model only ever pays full price for the thin slice of genuinely new tokens. That is how an agent can reread tens of thousands of tokens hundreds of times in a session without the bill — or the latency — exploding. Prompt caching isn't a feature bolted onto the agent; it is the load-bearing wall the whole structure rests on.

Frequently asked questions

Does changing the user's message at the end break the cache?

No. The user's newest message is part of the volatile tail, which sits after every cache breakpoint. New tail tokens are exactly what you expect to pay full price for. Caching breaks only when something before a breakpoint changes, which is why stable layers go first and live conversation goes last.

How long does a cached prefix survive?

By default a cached segment lives for a short window measured in minutes, and every read of that segment resets the clock. An active agent keeps it warm indefinitely; an idle one lets it expire. A longer-lived option exists for bursty workloads at a modestly higher write cost.

Is cache reuse guaranteed if my prefix matches?

It is best-effort, not contractual. A matching prefix within the TTL almost always hits, but caches can be evicted under load. Build for the common case — design stable prefixes — but never assume a hit is free of fallback cost. Measure your actual hit rate rather than trusting it blindly.

Where should I put my cache breakpoints?

At the boundaries between layers ordered by volatility: after the system prompt, after the tool and MCP schemas, and after durable project context. That gives you the longest possible reusable prefix while letting each layer invalidate independently when it genuinely changes.

Bringing agentic AI to your phone lines

The same caching-first architecture that makes Claude Code fast also powers real-time voice. CallSphere builds voice and chat agents that hold a warm, stable context across an entire call — answering instantly, using tools mid-conversation, and booking work around the clock. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.