---
title: "Prompt caching ROI with Claude: where the savings come from"
description: "Claude prompt caching cost model: 1.25x writes, 0.1x reads, break-even math, and where latency and dollar savings actually come from."
canonical: https://callsphere.ai/blog/prompt-caching-roi-with-claude-where-the-savings-come-from
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "cost optimization", "llm roi", "anthropic api"]
author: "CallSphere Team"
published: 2026-02-06T14:00:00.000Z
updated: 2026-06-07T01:28:24.134Z
---

# Prompt caching ROI with Claude: where the savings come from

> Claude prompt caching cost model: 1.25x writes, 0.1x reads, break-even math, and where latency and dollar savings actually come from.

The first time a team turns on prompt caching with Claude, the reaction is usually relief: the bill drops, the agent feels snappier, everyone moves on. The second time someone looks closely at the invoice, the reaction is confusion — because the savings are smaller than the marketing number, or the cache hit rate is suspiciously low, or a 'cached' system prompt is somehow billed at full price. Prompt caching is one of the highest-leverage cost levers available on the Claude API, but the leverage only shows up if you understand the actual price math underneath it. This post builds that model from the ground up, so you can predict your savings before you ship rather than hope for them afterward.

## Key takeaways

- Cache writes cost ~1.25x base input price (5-minute TTL) or ~2x (1-hour TTL); cache reads cost ~0.1x — that asymmetry is the entire ROI story.
- Break-even is two requests for the 5-minute TTL and roughly three for the 1-hour TTL — below that, caching costs you money.
- The savings scale with how much of your prompt is stable: a 50K-token shared prefix and a 200-token question is where caching shines.
- Latency wins come from prefill being skipped on cache reads, which matters most for interactive chat and voice, not batch jobs.
- Verify everything with `usage.cache_read_input_tokens` — a zero there means a silent invalidator is quietly erasing your ROI.

## What prompt caching actually charges you for

Prompt caching is a prefix match: the API stores the processed state of your prompt up to a marked breakpoint and reuses it on later requests that share the exact same leading bytes. The pricing has three distinct lines. Uncached input tokens bill at the model's normal input rate. Cache *writes* — the first time a prefix is stored — bill at roughly 1.25x that rate for the default 5-minute TTL, or 2x for the 1-hour TTL. Cache *reads* — every subsequent request that hits the stored prefix — bill at roughly 0.1x. That tenfold gap between writing and reading is where every dollar of savings lives.

A citable definition worth pinning down: prompt caching is a prefix-match mechanism where the cache key is derived from the exact bytes of the rendered prompt up to each `cache_control` breakpoint, so any byte change anywhere in that prefix invalidates the cache for all breakpoints at or after that position. That single sentence explains nearly every disappointing ROI story — a cache that never reads is almost always a cache whose prefix changed underneath it.

Because the write costs more than an uncached read, caching is not free insurance. If a prefix is written once and never read again, you have spent 1.25x to gain nothing. The model only pays off when the same prefix is reused, and the more times it is reused before the TTL expires, the closer your effective input cost converges toward 0.1x.

## The break-even model in numbers

Walk the arithmetic for the 5-minute TTL. Without caching, two requests that share a prefix each cost 1x for that prefix, so 2x total. With caching, the first request pays 1.25x to write and the second pays 0.1x to read, so 1.35x total. Two requests already beat the uncached baseline. For the 1-hour TTL, the write is 2x, so two requests cost 2.1x versus 2x uncached — a slight loss — and you need a third read (2.2x versus 3x uncached) before it pays. The rule of thumb: 5-minute TTL breaks even at two requests, 1-hour at three.

```mermaid
flowchart TD
  A["Shared prefix request"] --> B{"Prefix already cached?"}
  B -->|No| C["Cache write ~1.25x base input"]
  B -->|Yes & within TTL| D["Cache read ~0.1x base input"]
  C --> E["Volatile suffix billed at 1x"]
  D --> E
  E --> F{"More reads before TTL expires?"}
  F -->|Yes| G["Effective cost trends toward 0.1x"]
  F -->|No| H["Write premium wasted > no ROI"]
```

Now apply it to a realistic shape. Say your agent carries a 50,000-token system prompt plus tool definitions, and each user turn appends a 300-token question and produces a 500-token answer. Uncached, every turn re-bills all 50,300 input tokens at full rate. Cached, the 50,000-token prefix is written once and then read at 0.1x for the rest of the session, while only the 300-token question pays full price each turn. Over a 20-turn session, the cached input cost is roughly one full-price write of 50K, nineteen reads at 5K-equivalent, plus 20 small questions — a reduction on the order of 85-90% of input spend versus the uncached path. That is the canonical win: a large stable prefix amortized across many turns.

## Where latency savings hide

Dollars are only half the ROI. The other half is time. When a request hits a cached prefix, the model skips re-running prefill over those tokens — it loads the stored state instead of recomputing it. For a 50K-token prefix, that is the difference between waiting for the model to read fifty thousand tokens and waiting for it to read three hundred. Time-to-first-token drops sharply, which is precisely what an interactive user feels.

This is why the latency ROI is wildly uneven across workloads. A nightly batch job that processes documents overnight does not care about a half-second of prefill; the dollars matter but the milliseconds do not. A live chat assistant, a coding agent, or a voice agent answering a phone call cares enormously — every saved prefill round trip is a turn that feels instant instead of laggy. When you are building the business case for caching, separate the two benefits: the cost saving applies everywhere, but the latency saving should be valued in proportion to how interactive the surface is.

## Why measured ROI underperforms the model

The gap between predicted and realized savings is almost always a silent invalidator: something volatile slipped into the cached prefix. A `datetime.now()` stamped into the system prompt, a request UUID near the top of the context, an unsorted JSON serialization of tool definitions, or a tool set that varies per user — any of these changes the prefix bytes on every request, so every request writes a fresh cache entry and never reads one. The bill looks like you are paying the 1.25x write premium on every single call, which is strictly worse than not caching at all.

The fix is structural, not cosmetic. Keep the frozen, never-changing content first (system prompt, deterministic tool list), place the breakpoint at the boundary between stable and volatile, and push everything that varies per request after the last breakpoint. Then prove it. The `usage` object on every response reports `cache_creation_input_tokens` (what you wrote) and `cache_read_input_tokens` (what you read). If reads stay at zero across repeated identical-prefix requests, your ROI model is fiction until you find and remove the invalidator.

## Common pitfalls

- **Caching a prefix below the minimum.** The minimum cacheable prefix is ~4096 tokens on Opus and Haiku 4.5, ~2048 on Sonnet 4.6. A 3K-token prompt silently won't cache on Opus — no error, just `cache_creation_input_tokens: 0`. Check your prefix length before assuming caching is on.
- **Treating the 1-hour TTL as a free upgrade.** It doubles the write cost. It only pays off for bursty traffic with gaps longer than five minutes; for continuous traffic, the default TTL is cheaper because real requests keep it warm.
- **Counting only `input_tokens` when sizing spend.** That field is the uncached remainder only. Total prompt size is `input_tokens + cache_creation + cache_read`. An agent that ran for an hour showing 4K input_tokens processed far more than 4K — the rest was cache reads.
- **Pre-warming when traffic is already continuous.** A `max_tokens: 0` warm-up call is pure extra write cost if real requests already arrive within the TTL. Pre-warm only when first-request latency is user-visible and there's idle time before traffic.
- **Ignoring the model-scoped cache.** Switching models mid-session invalidates the cache entirely — caches are keyed per model. A routing layer that bounces between Opus and Sonnet on the same conversation will never accrue reads.

## Build the ROI case in five steps

1. Trace your prompt assembly and split every input into stable (never changes), per-session, and per-request buckets.
2. Estimate the stable prefix token count and confirm it clears the model's cacheable minimum.
3. Count expected reads per prefix before the TTL expires — if it's under two, don't cache that prefix.
4. Place one `cache_control` breakpoint at the stable/volatile boundary and ship to a small slice of traffic.
5. Read `cache_read_input_tokens` on real responses, compare realized savings to your model, and hunt invalidators if reads are low.

## A quick decision table

| Workload | Cache it? | Why |
| --- | --- | --- |
| 50K shared prefix, many turns | Yes, 5-min TTL | High reuse, large prefix — textbook ROI |
| Bursty traffic, gaps > 5 min | Yes, 1-hour TTL | Keeps entries alive across idle gaps |
| Prefix differs every request | No | No reusable prefix; pay write for nothing |
| Prefix below cacheable minimum | No | Silently won't cache |
| Single one-shot call | No | One write, zero reads — strictly a loss |

## A copy-pasteable verification snippet

After enabling caching, run two identical-prefix requests and inspect the usage on the second one. If `cache_read_input_tokens` is non-zero, your ROI model is real; if it's zero, something in the prefix is changing between calls.

```
response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[{"type": "text", "text": LARGE_STABLE_PROMPT,
             "cache_control": {"type": "ephemeral"}}],
    messages=[{"role": "user", "content": question}],
)
u = response.usage
print("wrote:", u.cache_creation_input_tokens)  # ~1.25x billed
print("read: ", u.cache_read_input_tokens)       # ~0.1x billed
print("full: ", u.input_tokens)                  # uncached remainder
```

## Frequently asked questions

### How much can prompt caching realistically save on input cost?

For a workload with a large stable prefix reused across many requests, input-token savings of 80-90% are common, because cached reads bill at roughly one-tenth of the base input rate. The realized figure depends entirely on your read-to-write ratio: the more reads per write before the TTL expires, the closer your effective cost gets to 0.1x.

### Does caching ever cost more than not caching?

Yes. A prefix written but never read costs the ~1.25x (or 2x) write premium for zero benefit. Single one-shot calls, prompts whose prefix changes every request, and prefixes below the cacheable token minimum all lose money. Caching pays only when the same prefix is read at least twice within the TTL.

### Why is my cache read count zero even though I set cache_control?

A volatile value is in your prefix — a timestamp, a UUID, an unsorted JSON dump, or a per-request tool set. Any byte change invalidates the prefix, so every request writes fresh and none reads. Diff the rendered prompt bytes between two requests to find the moving part, then push it after the last breakpoint.

### Should I use the 5-minute or 1-hour TTL for cost?

Use the 5-minute TTL for continuous traffic — real requests keep the cache warm and the write premium is lower. Use the 1-hour TTL only for bursty traffic with idle gaps longer than five minutes, accepting the doubled write cost in exchange for entries surviving the gaps.

## Bringing agentic AI to your phone lines

CallSphere applies these same agentic-AI patterns to **voice and chat** — multi-agent assistants that answer every call and message, use tools mid-conversation, and book work 24/7. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/prompt-caching-roi-with-claude-where-the-savings-come-from
