---
title: "Implement Prompt Caching in Claude Code: A Walkthrough"
description: "A step-by-step walkthrough to add prompt caching to a Claude agent loop: structure requests, place breakpoints, prove hits, and avoid cold starts."
canonical: https://callsphere.ai/blog/implement-prompt-caching-in-claude-code-a-walkthrough
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "claude code", "claude agent sdk", "implementation guide"]
author: "CallSphere Team"
published: 2026-04-30T08:23:11.000Z
updated: 2026-06-06T21:47:42.810Z
---

# Implement Prompt Caching in Claude Code: A Walkthrough

> A step-by-step walkthrough to add prompt caching to a Claude agent loop: structure requests, place breakpoints, prove hits, and avoid cold starts.

Reading about prompt caching is one thing; wiring it into a real agent loop and watching your token bill drop by an order of magnitude is another. This walkthrough takes you from a naive, cache-blind agent to a properly cached one, step by step, in the way an engineer would actually do it on a Tuesday afternoon. We'll structure the request, place breakpoints, prove the cache is hitting, and handle the failure modes that quietly cost money. The examples use the shape of a Claude Code-style agent, but the technique transfers to anything you build on the Claude Agent SDK.

## Step 1: Start with a request you can measure

Before optimizing anything, instrument it. Every response from the API reports usage fields that tell you exactly what the cache did: how many input tokens were written to cache, how many were read from cache, and how many were processed fresh. Log these on every turn. If you can't see `cache_read_input_tokens` climbing turn over turn, you have no idea whether your changes help. The single most common mistake is "adding caching" and never confirming a hit; you end up paying write premiums forever with no reads to amortize them.

So step one is a baseline run with logging. Send a multi-turn conversation with no cache control at all and record the input-token cost of each turn. You'll see the fat prefix re-billed at full price every iteration. That number is the thing you're about to crush.

## Step 2: Separate the stable from the volatile

Now physically reorganize how you build the request. Pull every piece of content that stays constant within a session to the front: the system prompt, the tool definitions, any retrieved reference material that won't change mid-task. Keep the things that mutate every turn — the latest user message, the freshest tool result — at the very back. This is a code change in how you assemble the message array, not a config flag. If your current code interpolates a timestamp or a request ID anywhere in the system prompt, rip it out or move it to the tail; it is silently invalidating your cache on every call.

```mermaid
flowchart TD
  A["Baseline: log usage, no caching"] --> B["Reorder content: stable first, volatile last"]
  B --> C["Mark breakpoint after system + tools"]
  C --> D["Send turn 1: cache_write tokens recorded"]
  D --> E["Send turn 2: check cache_read tokens"]
  E --> F{"cache_read > 0 ?"}
  F -->|No| G["Hunt the changing token in the prefix"] --> B
  F -->|Yes| H["Add breakpoint after project context"]
  H --> I["Measure hit rate & cost across the loop"]
```

## Step 3: Place the breakpoint and write the cache

With content ordered, attach a cache-control marker to the last block of your stable region — typically the final tool definition or the closing block of your reference context. This tells the server to hash everything up to that point and store it. On the very first request, you'll see a large `cache_creation_input_tokens` value and a near-zero read. That's expected: turn one pays the write premium. Don't panic at the slightly higher first-turn cost; you're buying an asset you'll reuse.

On turn two, if you've done it right, the same prefix is sent again and the usage flips: `cache_read_input_tokens` jumps to roughly the size of your stable prefix, and `cache_creation` drops toward zero. That read is billed at the deep discount. This is the moment the whole thing pays off, and it's why step one's logging mattered — you can literally see the numbers move.

## Step 4: Debug a cache that won't hit

If turn two still shows zero reads, something in your prefix changed between turns. The usual culprits, in rough order of frequency: a timestamp or date injected into the system prompt; tool definitions assembled from an unordered map so their serialization order shifts run to run; a randomly generated session ID embedded high in the context; or whitespace and JSON formatting that differs because two code paths serialize the same data differently. Diff the exact bytes of turn one's prefix against turn two's. The mismatch is almost always something you didn't think of as "content," like key ordering in a serialized object.

The fix is determinism. Serialize tool schemas in a fixed order. Freeze any dynamic values out of the stable region. Make your prompt-assembly function pure: same session state in, identical bytes out. Once the prefix is byte-stable, the cache hits reliably.

## Step 5: Add the second and third breakpoints

One breakpoint is good; layered breakpoints are better. Add a second after your tool catalog and a third after durable project context. Now each layer can invalidate independently. If a user connects a new MCP server mid-session, only the tool layer and everything after it rewrites — the system-prompt segment stays cached. Without layered breakpoints, that same event would rewrite the entire prefix. Spend your limited breakpoints on the boundaries where independent change actually happens.

A practical pattern: breakpoint one closes the system prompt, breakpoint two closes the tool and MCP schemas, breakpoint three closes the static project context. The live transcript grows after all three with no breakpoint of its own, since it changes every turn anyway and there's nothing stable to cache there.

## Step 6: Keep the cache warm and avoid cold starts

The last step is operational. Cached prefixes expire after an idle window, so a cold start — the first turn after a pause — pays full write price again. For interactive agents this is usually fine; the user is typing, the gap is short, the cache survives. For agents that run on a schedule or sit idle between bursts, consider the longer-TTL option, or fire a tiny keep-warm turn before a known burst of work. Either way, measure: track your cache-read ratio over a full session and treat any unexpected dip as a regression to investigate.

When you finish, compare against your step-one baseline. A well-cached agent loop typically reads the bulk of its input tokens from cache on every turn after the first, turning a linearly growing cost into a nearly flat one. That flat line is the entire point of the exercise.

## Frequently asked questions

### How do I know my cache is actually working?

Check the usage object on each response. A working cache shows a large `cache_read_input_tokens` on turns after the first and a small `cache_creation_input_tokens`. If reads stay at zero, your prefix is changing between turns — diff the raw bytes to find the culprit.

### Why was my first turn more expensive than expected?

The first turn writes the cache, which carries a premium over the base input price. That's normal and is recouped on every subsequent read. The model to keep in mind is: pay a little extra once, save a lot many times. If you only ever send one turn, caching won't pay off.

### How many breakpoints should I use?

You have a small fixed budget per request, so place them at genuine volatility boundaries — typically after the system prompt, after tools, and after static context. More breakpoints aren't better; well-chosen ones are. Each lets a layer invalidate without taking down the layers above it.

### What ruins a cache hit most often?

Non-determinism in the stable prefix: injected timestamps, random IDs, and inconsistent serialization order of tool schemas. Make prompt assembly a pure function of session state so the same state always produces identical bytes, and the hit rate becomes reliable.

## Bringing agentic AI to your phone lines

These same step-by-step caching techniques keep latency low when an agent has to answer in real time. CallSphere uses them in **voice and chat agents** that respond instantly, call tools mid-conversation, and book jobs 24/7. Hear it for yourself at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/implement-prompt-caching-in-claude-code-a-walkthrough
