Implement Claude Prompt Caching: A Step-by-Step Guide
Add prompt caching to a Claude app in five steps: place cache_control breakpoints, verify hits in usage, fix invalidators, and pre-warm the cache.
Reading about prefix matching is one thing; getting a real request to come back with cache_read_input_tokens above zero on your own prompt is another. The first time most engineers add cache_control, they put the breakpoint in a plausible-looking spot, run two requests, and see no read at all — and assume caching is broken. It almost never is. The breakpoint was placed where the content varies, or the prefix was below the size threshold, or a stray timestamp moved the bytes. This post is the implementation walkthrough that gets you from a working but uncached Claude call to a verified, cost-reducing caching pipeline, one concrete step at a time.
We will use Python and the official anthropic SDK throughout, against claude-opus-4-8. Every snippet is runnable; the only thing you supply is an API key in the environment. By the end you will have a request that caches a large shared context, a way to prove the cache is being read, and a checklist for rolling it across a codebase.
Key takeaways
- Start from a working uncached call, then add exactly one breakpoint and measure — do not place four markers blind.
- Use top-level
cache_controlfor the simple case; switch to per-blockcache_controlwhen you need precise placement. - The breakpoint goes at the end of the shared prefix, before any per-request content, or it caches nothing reusable.
- Confirm success by reading
usage.cache_creation_input_tokenson the first call andusage.cache_read_input_tokenson the second. - For interactive apps, consider pre-warming with a
max_tokens: 0request so the first real user does not eat the cold write.
Step 1: Get a baseline uncached request running
Before caching anything, confirm the plain request works and note its token usage. This is your control. Here we send a large shared document as the system prompt and ask a question about it.
import anthropic
client = anthropic.Anthropic()
LARGE_DOC = open("handbook.md").read() # e.g. ~30K tokens
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
system=LARGE_DOC,
messages=[{"role": "user", "content": "Summarize the refund policy."}],
)
print(resp.usage.input_tokens) # full price
print(resp.usage.cache_read_input_tokens) # expect 0 hereRun it once and write down input_tokens. That number is what you are about to stop paying repeatedly. If the document is smaller than 4,096 tokens on Opus, caching will not engage at all, so verify the size first — that single check saves an hour of confused debugging.
Step 2: Add one breakpoint and measure the write
The simplest way to cache is top-level cache_control, which auto-places the marker on the last cacheable block — here, the system prompt. Add it and rerun.
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
cache_control={"type": "ephemeral"}, # auto-caches the last cacheable block
system=LARGE_DOC,
messages=[{"role": "user", "content": "Summarize the refund policy."}],
)
print(resp.usage.cache_creation_input_tokens) # now non-zero: you wrote the cacheOn this first cached call, cache_creation_input_tokens should jump to roughly the size of your document and input_tokens should drop to just the user question. You paid the 1.25x write premium on the document this once. Nothing is saved yet — the payoff comes on the next read.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3: Fire a second request and confirm the read
Send a different question against the same cached prefix within the five-minute TTL. The system prompt bytes are identical, so the prefix matches.
resp2 = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
cache_control={"type": "ephemeral"},
system=LARGE_DOC,
messages=[{"role": "user", "content": "What is the warranty window?"}],
)
print(resp2.usage.cache_read_input_tokens) # ~document size, billed at ~0.1x
print(resp2.usage.input_tokens) # just the new questionIf cache_read_input_tokens is large and input_tokens is tiny, caching works end to end. If the read is still zero, you have a silent invalidator — the most common culprit is dynamic content in the system prompt, which the next step addresses.
It is worth pausing here to read the economics off these two numbers, because they tell you whether caching is actually paying for itself rather than just appearing to function. On the first call you paid roughly 1.25 times the normal input price to write the document into the cache. On the second call you paid roughly one tenth of normal price to read it back. So the break-even point arrives almost immediately: after just two requests over the same prefix you are already ahead, and every request after that is nearly free for the cached span. For a support assistant answering hundreds of questions against the same handbook within a five-minute window, the document is written once and read hundreds of times — the savings are not marginal, they are the difference between a viable and an unviable unit cost.
flowchart TD
A["Baseline uncached call"] --> B["Add cache_control breakpoint"]
B --> C["First call writes cache (~1.25x)"]
C --> D["Second call, same prefix"]
D --> E{"cache_read > 0 ?"}
E -->|Yes| F["Roll out + add monitoring"]
E -->|No| G["Diff rendered bytes, fix invalidator"]
G --> DTreat the loop in the diagram as your inner development cycle: never move to rollout until the second call shows a positive read. The diff-and-fix branch is short because there are only a handful of things that can move the prefix bytes.
Step 4: Move volatile content out of the prefix
Suppose you discovered the read was zero because your system prompt began with f"Current date: {datetime.now()}". That single interpolation changes the front of the stream every request, so nothing downstream ever matches. The fix is to freeze the system prompt and inject the date later, in the message history, where it invalidates only the turns after it.
STABLE_SYSTEM = LARGE_DOC # no timestamps, no IDs, no per-request text
resp = client.messages.create(
model="claude-opus-4-8",
max_tokens=1024,
system=[{"type": "text", "text": STABLE_SYSTEM,
"cache_control": {"type": "ephemeral"}}],
messages=[
{"role": "user", "content": f"(Context: today is {today}.) What is the warranty window?"}
],
)This is the per-block form of cache_control, which gives you explicit placement. The rule generalizes: anything that varies per request — session IDs, user names, the current time, a request UUID — must live after the last breakpoint, never in the frozen prefix.
To actually find an invalidator when the read stays stubbornly at zero, the fastest move is to capture the exact rendered payload of two consecutive requests and diff them. Serialize the full request body you send — system, tools, and the message prefix up to your breakpoint — to JSON and compare. The offending difference is almost always one of a short list: a non-sorted json.dumps producing different key order, a tool list that was rebuilt in a different sequence, or an f-string that quietly interpolated a value. Once you can see the byte difference, the fix is obvious, and the diff becomes a permanent debugging habit rather than a one-time fire drill.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 5: Pre-warm and roll out with monitoring
For interactive apps, the first user after a cold start eats the full write latency. You can pay that down ahead of traffic with a max_tokens: 0 request at startup, which runs the prefill and writes the cache without generating output.
client.messages.create(
model="claude-opus-4-8",
max_tokens=0,
system=[{"type": "text", "text": STABLE_SYSTEM,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "warmup"}],
)Put the breakpoint on the block shared with real requests, not on the placeholder message. Then, in production, log the three usage fields per request and alert if your read ratio drops — a sudden fall to zero almost always means someone edited the system prompt or changed the tool set. Roll the pattern out one call site at a time, re-running the two-request check at each, so a regression is caught at the source rather than in aggregate metrics.
Ship this in five steps
- Capture baseline
input_tokenson the uncached call and confirm the prefix exceeds the model's minimum (4,096 on Opus 4.8). - Add top-level
cache_controland confirmcache_creation_input_tokensis non-zero on the first call. - Send a second request with the same prefix and confirm
cache_read_input_tokensis non-zero. - If the read is zero, freeze the system prompt and move all per-request content after the last breakpoint.
- Pre-warm with a
max_tokens: 0call for interactive paths, then roll out with usage-field monitoring.
Frequently asked questions
Top-level versus per-block cache_control — which should I use?
Top-level is the simplest: it auto-places one marker on the last cacheable block and is perfect when you just want to cache a big system prompt or document. Switch to per-block markers when you need multiple breakpoints, want to cache the shared half of a message and not the varying half, or are building a multi-turn agent where placement is precise.
How many breakpoints can I set?
Up to four cache_control breakpoints per request. Most apps need one or two — one on the tools-plus-system prefix, optionally one on the last conversation turn. Reserve the others for long agent turns where you need an intermediate marker inside the twenty-block lookback window.
Will caching help a request whose prompt changes from the start every time?
No. If the first thousand tokens differ on every request there is no reusable prefix, and adding cache_control only charges you the write premium with no reads. Leave caching off for genuinely unique prompts and spend the effort on the requests that share a large fixed preamble.
Does the second request have to be identical to hit the cache?
Only the prefix up to your breakpoint must be byte-identical. The content after the breakpoint — the actual user question — can and should differ. That is exactly the shape that makes caching valuable: shared context cached once, distinct questions answered cheaply.
Bringing agentic AI to your phone lines
The same step-by-step caching discipline powers CallSphere's voice and chat agents — assistants that pick up every call and message, use tools mid-conversation, and book work 24/7 without re-paying for the same context. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.