---
title: "Prompt Caching With Claude: A Full Build Walkthrough"
description: "A realistic end-to-end build: how a support-agent team cut latency and cost with prompt caching on Claude, from slow response to a shipped, measured win."
canonical: https://callsphere.ai/blog/prompt-caching-with-claude-a-full-build-walkthrough
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "use case", "support agent", "latency"]
author: "CallSphere Team"
published: 2026-02-06T17:46:22.000Z
updated: 2026-06-07T01:28:24.169Z
---

# Prompt Caching With Claude: A Full Build Walkthrough

> A realistic end-to-end build: how a support-agent team cut latency and cost with prompt caching on Claude, from slow response to a shipped, measured win.

Abstract advice about prompt caching with Claude is easy to nod along to and hard to act on. So this post does something different: it walks through one realistic build from the symptom that started it to the shipped, measured outcome. The team is a fictional-but-typical one — a five-person group running a customer-support agent on Claude that answers questions over a product knowledge base and can call a few tools. Their problem is concrete, their fix is concrete, and every decision along the way is one you will face if you do this yourself.

We will not skip the messy parts. The first attempt does not work cleanly, the cache mysteriously stops hitting, and the team has to debug it. That is the realistic path, and seeing it end to end is more useful than a tidy success story.

## Key takeaways

- A real caching win starts from a **latency or cost symptom**, not from a desire to use the feature.
- The first refactor is usually **restructuring the request** so the stable knowledge base and tools sit in a cacheable prefix.
- Expect a **debugging detour**: the cache often does not hit on the first try because something volatile leaked into the prefix.
- Measure with the **usage fields** Claude returns, and compare p50/p95 latency and per-request cost before and after.
- The shipped outcome is typically a large drop in time-to-first-token and input cost on the hot path, with identical answer quality.

## The problem: a support agent that feels slow and costs too much

The team's support agent works like this. Each user question is answered by Claude with a hefty request: a detailed system prompt describing tone and escalation rules, definitions for four tools (order lookup, refund status, shipping estimate, and a knowledge search), and a chunk of frequently-referenced policy text pasted inline so the agent answers consistently. Then the user's actual question is appended at the end. Every single request reships that entire stable mass — easily several thousand tokens — before the model even sees the question.

Two symptoms drove the project. First, **time-to-first-token was poor**: users waited noticeably before the answer began streaming, because the model had to process those thousands of prefix tokens every time. Second, **input cost dominated the bill**. The answers were short, but the input was enormous and repeated on every request. The team realized the same multi-thousand-token prefix was being reprocessed thousands of times a day, identical each time. That is the textbook signature of a workload that caching was built for.

## The build: restructuring for a cacheable prefix

The fix began not with the API but with the request layout. The team reorganized every request into a strict order: system prompt first, then tool definitions, then the inline policy text, then — and only then — the user's question and any per-conversation context. The principle was simple: everything that is identical across requests goes up top, everything that changes goes at the bottom. Then they placed a cache breakpoint at the boundary between the stable mass and the user turn.

Here is the shape of the restructured request. The cacheable block is everything in the system content; the user turn carries only what is specific to this conversation:

```
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": SYSTEM_PROMPT + TOOL_DOCS + POLICY_TEXT,
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": f"Customer question: {question}"
            }
        ]
    }
]
```

The key line is `cache_control` on the stable block. It tells Claude this prefix is a caching boundary: process it once, reuse it on subsequent requests that begin with the identical bytes. The user's question, appended as a separate piece with no cache marker, is the volatile tail that changes every call. On the first request, the team saw `cache_creation_input_tokens` jump — the cache was being written. They expected the next request to read it back cheaply.

```mermaid
flowchart TD
  A["User asks a question"] --> B["Assemble request: system + tools + policy, then question"]
  B --> C{"First request with this prefix?"}
  C -->|Yes| D["Cache write: process whole prefix"]
  C -->|No| E["Cache read: skip prefix processing"]
  D --> F["Claude answers, may call a tool"]
  E --> F
  F --> G["Stream answer, log usage fields"]
  G --> H{"cache_read ratio rising?"}
  H -->|No| I["Debug volatile leak in prefix"]
  H -->|Yes| J["Win confirmed, monitor"]
```

## The detour: why the cache wasn't hitting

It did not work on the first try. The team shipped to staging, ran a hundred test questions, and saw `cache_creation_input_tokens` on nearly every request — the cache was being written constantly but almost never read. Costs had not dropped. This is the most common caching surprise, and the cause is almost always the same: something in the supposedly-stable prefix was not actually stable.

They bisected. By logging the exact prefix bytes and diffing two consecutive requests, they found it: the policy text was being assembled from a dictionary whose key order was not fixed, so the serialized policy came out in a slightly different order each time. The bytes differed, so Claude correctly treated each as a new prefix and wrote a fresh cache entry it would never read again. The fix was one line — sort the keys before serializing — and suddenly the second request and every one after it showed a large `cache_read_input_tokens` value and a near-zero creation value. The cache was warm.

The lesson the team wrote into their runbook: *a cache that always writes and never reads means your prefix is not byte-stable.* Diff two consecutive prefixes and find the wobble. It is never the API; it is always something in your own assembly that you did not realize was non-deterministic.

## The outcome: what shipped and what it bought

With the cache warm, the results were exactly what the workload predicted. Time-to-first-token on the hot path dropped sharply, because the model no longer reprocessed thousands of prefix tokens on each request — it read them from cache. Per-request input cost fell substantially on every cached read, since cache-read tokens are far cheaper than fresh input tokens. Critically, the team confirmed with an eval suite that answer quality was identical: caching changed the cost and speed of producing the same outputs, not the outputs themselves.

They also learned where caching did *not* help. A handful of admin and onboarding flows ran each prefix only once or twice; there, caching paid the write premium without enough reads to recover it, so they left those uncached. The win was concentrated exactly where the brief promised: the high-traffic path with a large, repeated, stable prefix.

| Aspect | Before caching | After caching |
| --- | --- | --- |
| Prefix processing | Reprocessed every request | Processed once, then read from cache |
| Time-to-first-token (hot path) | Slow, noticeable wait | Markedly faster |
| Input cost per request | High and repeated | Low on every cache read |
| Answer quality | Baseline | Identical (verified by eval) |

## Common pitfalls this build avoided

- **Caching before measuring.** The team started from a real latency and cost symptom, so they knew exactly what success looked like.
- **Trusting the cache without reading usage.** They logged `cache_read` vs `cache_creation` from day one, which is how they caught the silent miss immediately.
- **Non-deterministic serialization.** The dictionary ordering bug is the canonical caching trap; sorting keys fixed it.
- **Caching cold paths.** They resisted the urge to cache everything and left one-off flows alone.
- **Skipping the quality check.** An eval confirmed identical answers, so nobody could later blame caching for a regression it did not cause.

## Ship your own caching win in five steps

1. Find a hot path with a large, repeated, stable prefix and a real latency or cost symptom.
2. Reorder the request so all stable content (system prompt, tools, reference text) precedes the user turn.
3. Add a cache breakpoint at the stable/volatile boundary and deploy to staging.
4. Log the usage fields; if reads stay near zero, diff consecutive prefixes and kill the volatile leak.
5. Once reads are warm, measure latency and cost before/after and run an eval to confirm identical quality.

## Frequently asked questions

### How long does an end-to-end caching project like this take?

The structural work — reordering the request and adding a breakpoint — is often an afternoon. The realistic time sink is the debugging detour when the cache does not hit, plus writing the eval to confirm quality. Budget a few days end to end for a careful first rollout, much less for subsequent ones once the team knows the pattern.

### What if my prefix is small — is the walkthrough still relevant?

Less so. Caching pays off when the stable prefix is large and reused often; a small prefix gives little to save. If your requests are dominated by a short instruction and a long user turn, caching will not move your numbers much, and you can skip it for that path.

### How did the team know caching didn't change answers?

They ran the same set of inputs through the agent before and after the caching change and compared outputs with an eval suite. Because caching only affects how the prefix is processed, not the model's reasoning, the answers matched — and having that proof on record prevented future false blame.

### What was the single biggest source of savings?

Reusing the multi-thousand-token stable prefix on the high-traffic path. That prefix was being reprocessed on every one of thousands of daily requests; turning those into cheap cache reads is where almost all of the latency and cost improvement came from.

## Bringing agentic AI to your phone lines

This same end-to-end discipline — find the hot path, restructure, measure, ship — is how CallSphere builds agentic **voice and chat** assistants that answer every call and message, use tools mid-conversation, and book work 24/7. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/prompt-caching-with-claude-a-full-build-walkthrough