---
title: "How Claude Prompt Caching Works: Internals and Architecture"
description: "Inside Claude prompt caching: prefix hashing, tools-system-messages render order, invalidation tiers, and TTLs that cut latency and API cost."
canonical: https://callsphere.ai/blog/how-claude-prompt-caching-works-internals-and-architecture
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "anthropic", "llm architecture", "latency optimization", "cost optimization"]
author: "CallSphere Team"
published: 2026-02-06T08:00:00.000Z
updated: 2026-06-07T01:28:24.080Z
---

# How Claude Prompt Caching Works: Internals and Architecture

> Inside Claude prompt caching: prefix hashing, tools-system-messages render order, invalidation tiers, and TTLs that cut latency and API cost.

The first time I watched a long agent loop run for twenty minutes and bill only four thousand fresh input tokens, I stopped trusting my mental model of how the Claude API charges for input. The agent had pushed a 90,000-token prefix through dozens of turns, yet almost none of it was reprocessed. That gap between what I expected and what actually happened is the whole story of prompt caching: it is not a feature you bolt on, it is a property of how the request is hashed and matched on Anthropic's side. Understanding the machinery is what lets you design prompts that hit the cache by default instead of fighting it.

This post is the architecture tour. We will trace a request from the bytes you send to the cache entry that gets written, look at why the render order of `tools`, `system`, and `messages` matters, and unpack the tier system that decides what a given change actually invalidates. The goal is a model precise enough that you can predict your own cache-hit rate before you ship.

## Key takeaways

- **Caching is a prefix match.** The cache key is the exact bytes of the rendered prompt up to each breakpoint — one changed byte invalidates everything after it.
- **Render order is fixed:** `tools` → `system` → `messages`. Stable content must physically precede volatile content or nothing caches.
- **There are three invalidation tiers** (tools, system, messages); a change only invalidates its own tier and everything downstream.
- **Cache reads cost ~0.1x** base input price; writes cost 1.25x (5-min TTL) or 2x (1-hour TTL).
- **Minimum cacheable prefix is model-dependent** — 4096 tokens on Opus 4.8, lower on some Sonnet/Haiku models. Below it, caching silently no-ops.
- Verify everything with `usage.cache_read_input_tokens`; a persistent zero means a silent invalidator is in your prefix.

## What is prompt caching, precisely?

Prompt caching is a server-side optimization in which Anthropic stores the internal model state produced by processing a stable prefix of your prompt, then reuses that state on later requests whose prefix matches byte-for-byte, charging roughly a tenth of the normal input price for the reused span. The key word is *prefix*. The cache does not understand your prompt semantically; it hashes the rendered token sequence and looks for a previously computed continuation point.

Mechanically, when you mark a content block with `cache_control: {"type": "ephemeral"}`, you are planting a breakpoint. The API computes a cache key from every token rendered up to and including that block. On a later request, if the rendered tokens up to that same position are identical, the model skips recomputing attention over that span and loads the saved key-value state instead. This is why a single byte change at position N poisons every breakpoint at positions greater than or equal to N — the saved state is only valid for the exact sequence that produced it.

Two cost numbers govern whether this pays off. A cache *write* costs about 1.25x normal input price for the default five-minute TTL, because the model still has to do the full forward pass plus persist state. A cache *read* costs about 0.1x. So a single cached block needs to be read at least once to break even, and from the second read onward it is nearly free. For high-frequency prefixes — an agent system prompt, a large retrieved document, a fixed tool set — the savings compound fast.

## The render pipeline: tools, system, messages

Every request is flattened into one token stream in a fixed order before hashing: tool definitions first, then the system prompt, then the message history. This ordering is the single most important architectural fact, because it dictates where stable and volatile content can live. A breakpoint on the last system block caches both the tools and the system prompt together, since they render before it. A breakpoint on the final user turn caches the entire conversation up to that point.

```mermaid
flowchart TD
  A["Incoming request"] --> B["Render tools (position 0)"]
  B --> C["Render system prompt"]
  C --> D["Render messages[] in order"]
  D --> E{"cache_control breakpoint?"}
  E -->|Prefix bytes match a live entry| F["Load saved KV state — bill ~0.1x"]
  E -->|No match| G["Full forward pass — write entry, bill ~1.25x"]
  F --> H["Generate response"]
  G --> H
```

Read the diagram as a decision made at each breakpoint, walking the stream left to right. The model finds the deepest breakpoint whose entire preceding prefix matches a live cache entry, loads that state, and only does fresh computation for whatever follows. If your volatile content — a per-request timestamp, a session ID interpolated into the system header — sits early in that stream, it shifts the match point all the way back to the start and you pay full price for everything.

The practical rule that falls out: freeze the front of the stream. Tools should be serialized deterministically (sort by name, stable JSON key order). The system prompt should contain nothing that varies per request. Anything dynamic belongs in the messages array, as late as possible, ideally after your last breakpoint.

## Invalidation tiers and what each change costs

Not every change is catastrophic. The cache is organized into three tiers that mirror the render order, and a change only invalidates its own tier and the tiers below it. This is the difference between a tweak that costs you nothing and one that forces a full rebuild.

| Change | Tools cache | System cache | Messages cache |
| --- | --- | --- | --- |
| Add/remove/reorder a tool | Lost | Lost | Lost |
| Switch model | Lost | Lost | Lost |
| Edit system prompt text | Kept | Lost | Lost |
| Toggle `tool_choice` or `thinking` | Kept | Kept | Lost |
| Append a message | Kept | Kept | Lost (rebuilds from new turn) |

The takeaway is that tool-definition changes and model switches are the only edits that nuke the entire cache. Everything else is contained. You can flip `tool_choice` per request or enable thinking on one call without losing your expensive tools-plus-system cache. That means "modes" should never be implemented by swapping the tool set — give the model a tool that records a mode transition, or pass the mode as message content, so the position-0 tool block stays byte-identical.

## TTL, the lookback window, and concurrency edges

Cache entries live for five minutes by default, refreshed on each read, or one hour if you set `ttl: "1h"` on the breakpoint. The one-hour TTL doubles the write cost (2x instead of 1.25x), so it only pays off for bursty traffic with gaps longer than five minutes between requests. If real requests arrive more often than every five minutes, they keep the entry warm on their own and a longer TTL is wasted money.

Two edges trip up agent builders. First, each breakpoint walks backward at most twenty content blocks looking for a prior entry. In a tool-heavy turn that emits more than twenty tool_use and tool_result blocks, the next request's breakpoint can fall outside that window and silently miss — the fix is an intermediate breakpoint every fifteen or so blocks. Second, a cache entry only becomes readable after the first response begins streaming. Fire N identical requests in parallel at cold start and all N pay the write premium because none can read what the others are still writing. Send one, await its first token, then fan out the rest.

## Verifying the model in your head matches reality

The response `usage` object is your ground truth. `cache_creation_input_tokens` is what you wrote this request at the 1.25x premium; `cache_read_input_tokens` is what you reused at 0.1x; `input_tokens` is the uncached remainder at full price. Total prompt size is the sum of all three — so an agent that ran for hours but shows `input_tokens` of 4,000 was almost entirely served from cache.

If `cache_read_input_tokens` stays at zero across repeated requests that should share a prefix, you have a silent invalidator: a `datetime.now()` in the system prompt, an unsorted `json.dumps()` producing nondeterministic key order, or a tool set that varies per user. Diff the rendered prompt bytes of two consecutive requests and the offending difference jumps out. Treat that diff as the canonical debugging move — the architecture is simple enough that any cache miss has a byte-level cause.

## Frequently asked questions

### Does prompt caching change the model's output?

No. Caching reuses the internal state computed from a prefix; it does not alter sampling or the response distribution. A cached and uncached request with the same prompt and parameters produce statistically equivalent outputs. The only observable difference is latency and the `usage` token breakdown.

### Why is my cache write happening but never being read?

Almost always because the breakpoint sits at the end of content that varies per request. If you place the marker after the unique question rather than after the shared context, every request writes a distinct entry and none is ever reused. Move the breakpoint to the end of the shared portion.

### Is there a minimum size before caching kicks in?

Yes, and it is model-specific. On Opus 4.8 and Opus 4.7 the minimum cacheable prefix is 4,096 tokens; some Sonnet and Haiku models cache from 2,048 or 1,024. Below the threshold the marker is accepted but no entry is written — `cache_creation_input_tokens` comes back zero with no error.

### Do model upgrades invalidate my cache?

Yes. Caches are scoped to the exact model ID. Switching from one Claude model to another, even within the same family, means the first request on the new model writes the cache fresh. Plan a brief warm-up cost into any model migration.

## Bringing agentic AI to your phone lines

CallSphere builds on these same caching internals to keep **voice and chat** agents fast and affordable — assistants that answer every call and message, call tools mid-conversation, and book work around the clock. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/how-claude-prompt-caching-works-internals-and-architecture
