---
title: "Measuring Prompt Caching Success With Claude"
description: "Measure prompt caching with Claude: the cache-read ratio, time-to-first-token, cost-per-request, and the signals that prove caching is actually working."
canonical: https://callsphere.ai/blog/measuring-prompt-caching-success-with-claude
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "metrics", "observability", "latency"]
author: "CallSphere Team"
published: 2026-02-06T18:09:33.000Z
updated: 2026-06-07T01:28:24.175Z
---

# Measuring Prompt Caching Success With Claude

> Measure prompt caching with Claude: the cache-read ratio, time-to-first-token, cost-per-request, and the signals that prove caching is actually working.

You can turn on prompt caching with Claude and feel like something good happened — the responses seem snappier, the bill seems lower — without ever knowing whether it is actually working. Caching is invisible by design: there is no success banner, no error when it fails. If you do not measure it deliberately, you are running on vibes, and vibes are exactly how teams end up paying the cache-write premium for months while believing they are saving money. This post is about the metrics and signals that turn caching from a hope into a fact.

We will define the small set of numbers that actually matter, show where they come from, and explain how to read them so you can prove caching is delivering — and catch the moment it quietly stops.

## Key takeaways

- The single most important metric is the **cache-read ratio**: cache-read tokens as a share of total input tokens on a route.
- Pair it with **time-to-first-token** (latency) and **cost-per-request** (economics) to capture both halves of the win.
- All the raw numbers come straight from the **usage fields** Claude returns on every response.
- Track metrics **per route or prefix**, not globally — a healthy average can hide a dead cache on one path.
- The most valuable alert is a **drop in cache-read ratio**, which is your early warning of a silent miss.

## The four numbers that prove caching works

Prompt caching is a feature that reuses the processed form of a stable prompt prefix to reduce latency and input cost, and its effectiveness is fully observable through four measurements. The first and most diagnostic is the **cache-read ratio**: of all the input tokens you sent on a route, what fraction were served from cache rather than freshly processed? A high ratio means your hot prefix is warm and being reused; a low one means you are paying full price. This is the number to put on the wall.

The second is **time-to-first-token (TTFT)**. Caching's latency benefit comes from skipping the processing of the cached prefix, which is precisely the work that happens before the first output token streams. So TTFT is where caching shows up in user experience. Measure it at p50 and p95 on the cached route, before and after, and you will see the prefix-processing time fall out.

The third is **cost-per-request**, computed from token usage and the relative prices of fresh input, cache-write, and cache-read tokens. The fourth is a **quality signal** — an eval pass rate or answer-match check — that confirms caching did not change outputs. That last one is easy to forget and important to keep: it is what lets you trust that a cost win is not secretly a quality loss.

```mermaid
flowchart TD
  A["Claude response with usage fields"] --> B["Extract input, cache_read, cache_creation tokens"]
  B --> C["Compute cache-read ratio per route"]
  A --> D["Record time-to-first-token p50 / p95"]
  B --> E["Compute cost-per-request from token mix"]
  C --> F{"Ratio above baseline?"}
  F -->|No| G["Alert: silent miss"]
  F -->|Yes| H["Dashboard: caching healthy"]
  E --> H
  D --> H
```

## Where the numbers come from

You do not need a special metrics product; Claude hands you the raw data on every response. The usage object distinguishes the token types you need. Read these fields and you can compute everything else:

```
def caching_metrics(usage: dict) -> dict:
    read = usage.get("cache_read_input_tokens", 0)
    write = usage.get("cache_creation_input_tokens", 0)
    fresh = usage.get("input_tokens", 0)
    total_in = read + write + fresh
    return {
        "cache_read_ratio": read / total_in if total_in else 0.0,
        "cache_write_tokens": write,
        "fresh_input_tokens": fresh,
        "output_tokens": usage.get("output_tokens", 0),
    }
```

The `cache_read_ratio` this returns is your headline metric. On a well-cached hot path it should be high and steady; if it collapses, something changed your prefix. The function deliberately includes write tokens in the denominator so that a request that only ever writes — the silent-miss case — shows a read ratio near zero rather than being hidden. Emit these per request, tag them with the route name, and aggregate.

For TTFT, instrument the time between sending the request and receiving the first streamed token. For cost-per-request, multiply each token category by its price and sum. Both are cheap to add and both belong on the same dashboard as the read ratio, because together they tell the whole story: is the cache warm, is it making things faster, and is it making things cheaper?

## Read it per route, not globally

A global cache-read ratio is comforting and misleading. Imagine you have one massive hot path that caches beautifully and a dozen smaller paths where caching does nothing. The aggregate ratio looks great, dominated by the hot path, while a freshly-broken prefix on a secondary route bleeds money invisibly under the average. The fix is to slice every metric by route or by prefix identity, so each cached object has its own read ratio, its own TTFT, and its own cost line.

This per-route view is what makes regressions findable. When someone edits a shared prefix and breaks caching for one feature, the global number barely moves but that feature's read ratio drops off a cliff. An alert on per-route ratio catches it in minutes; a global alert might never fire. The discipline is the same one you apply to any cache: measure hit rate per cache, not across all caches at once.

| Metric | What it tells you | Healthy signal |
| --- | --- | --- |
| Cache-read ratio | Is the prefix actually being reused? | High and steady per hot route |
| Time-to-first-token | Did latency improve for users? | p95 drops after enabling |
| Cost-per-request | Are the economics working? | Falls on cached routes |
| Eval pass rate | Did caching change outputs? | Unchanged from baseline |

## Common pitfalls in measuring caching

- **Watching only the bill.** The invoice lags and aggregates; by the time it moves, a broken cache has bled for weeks. Watch the read ratio in near-real-time instead.
- **Reporting a global ratio.** Averages hide dead caches on individual routes. Always slice per route or prefix.
- **Ignoring the write tokens.** If you compute hit rate without including cache-write tokens, a pure-write silent miss can look fine. Put writes in the denominator.
- **No quality baseline.** Without an eval, you cannot distinguish a legitimate cost win from a quality regression that happened to coincide.
- **Measuring once and forgetting.** Caching health drifts as prefixes change; metrics must be continuous, not a one-time check at launch.

## Stand up caching metrics in five steps

1. Emit the three token counts (fresh input, cache-read, cache-write) plus route name on every Claude call.
2. Compute and chart the cache-read ratio per route; this is your primary health metric.
3. Instrument time-to-first-token at p50 and p95 on cached routes and compare before/after.
4. Derive cost-per-request from the token mix and trend it per route.
5. Alert when any route's cache-read ratio falls below its established baseline, and keep an eval running to guard quality.

## What "good" looks like over time

A healthy caching deployment has a characteristic shape on the dashboard. The cache-read ratio for each hot route is high and flat, with brief, explainable dips right after intentional prefix changes — a deploy that updates the system prompt will momentarily cold-start the cache, then recover as it re-warms. TTFT on those routes is visibly lower than the uncached baseline and stable. Cost-per-request sits well below the pre-caching line. And the eval pass rate is pinned to baseline, confirming nothing about quality moved.

The signals that demand action are equally characteristic: a read ratio that steps down and stays down (a prefix went non-deterministic or a volatile field leaked in), a TTFT that creeps back up (the cache is cold-starting too often), or a cost line that rises despite steady traffic (you are paying write premiums without reads). Each of those has a clear diagnosis precisely because you measured per route. The point of metrics is not to admire them when they are green; it is to make the failure modes loud enough to catch before the invoice does.

## Frequently asked questions

### What is the single best metric for prompt caching?

The cache-read ratio per route — cache-read tokens divided by total input tokens for that path. It directly answers the question that matters: is the stable prefix actually being reused? Everything else (latency, cost) follows from it, and a drop in this ratio is the earliest warning that caching has silently broken.

### Where do I get the data to compute these metrics?

From the usage fields Claude returns on every response: input tokens, cache-creation tokens, and cache-read tokens, plus output tokens. You compute the read ratio and cost directly from those, and you instrument time-to-first-token in your own client. No special tooling is required beyond logging.

### Why measure per route instead of overall?

Because a global average is dominated by your busiest path and can completely hide a broken cache on a smaller route that is quietly bleeding money. Per-route metrics make individual regressions visible and alertable, which is the whole point of measuring at all.

### How do I prove caching didn't hurt answer quality?

Keep an eval suite running the same inputs and asserting the outputs match the pre-caching baseline. Because caching changes only how the prefix is processed, quality should be identical — and having that check on record lets you confidently attribute any future quality issue to something other than caching.

## Bringing agentic AI to your phone lines

CallSphere instruments its agentic **voice and chat** assistants with this same metric discipline — read ratios, latency percentiles, and quality evals — so the agents that answer every call and book work 24/7 stay fast, cheap, and correct. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/measuring-prompt-caching-success-with-claude