Measuring prompt caching success in Claude Code
Caching failures are silent. Track cache hit rate, cost per completed task, latency, and eval pass rate to prove your Claude Code agent is working.
If prompt caching is the thing that makes a Claude Code agent affordable, then "is caching working?" is one of the most important production questions you can ask — and most teams cannot answer it. They know their agent runs. They have a vague sense the bill is lower than it would be otherwise. But they have no instrumentation that would tell them the morning the cache silently stopped helping. This post is about the metrics and signals that prove caching is earning its keep, and how to wire them so a regression is obvious instead of invisible.
The reason this is worth a whole article is that caching failures are quiet. Nothing breaks. The agent still answers, still calls tools, still ships. The only symptom is that everything costs more and runs slower, and unless you are measuring the right things, you will not notice until a finance review or an angry user. Good measurement turns that silent failure into a loud one.
The one metric that matters most: cache hit rate
The headline signal is the fraction of your input tokens that were served from cache rather than recomputed. A healthy long-running agent should be reading the overwhelming majority of its prefix from cache on every turn after the first. If that fraction drops, something volatile has crept into your prefix or your cache is expiring between turns.
Track cache hit rate per agent and per session, not just as a global average, because the global number can look fine while one agent thrashes badly. A useful framing: the first turn of a session is your cache-warming cost, and every turn after that should be mostly cache reads. If late turns in a session are not benefiting, your breakpoint or your prompt ordering is wrong.
Cost per completed task, not cost per call
The number executives care about is cost, but the naive version — cost per API call — is misleading for agents. An agent completes work over many calls, so the honest unit is cost per completed task: per fixed bug, per resolved ticket, per booked appointment. Caching's entire value shows up at this level, because it makes the second-through-hundredth call in a task nearly free.
flowchart TD
A["Agent session"] --> B["Measure: cache hit rate per turn"]
A --> C["Measure: cost per completed task"]
A --> D["Measure: latency to first action"]
B --> E{"Hit rate dropped?"}
C --> F{"Cost per task up?"}
E -->|Yes| G["Volatile content in prefix
or cache expiring"]
F -->|Yes| G
D --> H{"Latency up?"}
H -->|Yes| G
G --> I["Alert tied to prefix version
investigate & roll back"]
The diagram shows the three measurements converging on a single conclusion. Cache hit rate, cost per task, and latency-to-first-action are not independent dashboards; they are three windows onto the same underlying health. When any of them degrades, the cause is usually the same — the cache stopped helping — and the response is the same: find the prefix change that did it.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Latency as a caching signal
Latency is the metric your users feel and the one that builds or destroys trust in an agent. It is also a clean proxy for cache health, because reprocessing a large prefix is slow. When caching works, time-to-first-action is short because the model is not re-digesting a hundred thousand tokens. When it breaks, the agent gets noticeably sluggish before the bill even arrives.
Measure latency to the agent's first meaningful action, not just total session time, since that early moment is where prefix reprocessing dominates. A rising first-action latency is often the earliest visible symptom of a caching regression, sometimes hours before a cost report would surface it.
Quality signals that caching has not degraded behavior
Cost and speed are only half the picture. You also need to prove that the agent is still doing the right thing, because a cache redesign can change behavior subtly. The discipline here is a small, continuously-run eval suite: a fixed set of representative tasks with known-good outcomes, executed against production regularly, scoring whether the agent produced the right tool calls and results.
For a coding agent that might be a set of bugs with known fixes; for a support agent, a set of tickets with correct resolutions. The point is to have a behavioral tripwire that fires if a prompt or caching change quietly made the agent worse. Cost going down while quality drops is not a win, and only a live eval suite catches that combination.
Tying every signal to the prefix version
The metrics are only actionable if you can attribute a change to a cause. Stamp every agent action with the version of the cached prefix that produced it, and tag your cost, latency, and quality metrics with that version. Then any regression points directly at a specific change. Without versioning, you see that costs rose last Tuesday but cannot say which edit did it, and your investigation becomes archaeology.
This is the same attribution discipline good teams already apply to deploys. Applied to the cached prefix, it turns a fuzzy "the agent feels worse lately" into "version 47 dropped cache hit rate from 94% to 60%, here is the diff." That specificity is the whole payoff of measuring well.
Putting it together as a scorecard
A practical agent scorecard has four lines: cache hit rate (trending high and stable), cost per completed task (trending down or flat as volume grows), latency to first action (low and stable), and eval pass rate (high and stable). Reviewed together each week against prefix versions, those four numbers tell you whether caching is delivering and warn you the moment it stops. Any one of them moving the wrong way is your signal to investigate before the problem compounds across thousands of sessions.
The teams that win with agentic AI are not the ones with the cleverest prompts. They are the ones who can see what their agent costs and how well it works, in near real time, attributed to a cause. Measurement is what converts "caching is everything" from a slogan into an operational fact.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Leading versus lagging signals
It helps to sort your four scorecard lines into leading and lagging signals, because they serve different jobs. Cache hit rate and latency to first action are leading indicators: they move the instant something volatile enters the prefix, often within the first sessions after a bad deploy. Cost per completed task and eval pass rate are lagging: they confirm the damage but take longer to accumulate enough signal to be trustworthy. A mature setup alerts on the leading signals for speed and reviews the lagging ones for confirmation, so you react fast without overreacting to noise.
The trap is to monitor only the lagging signals because they are the ones leadership asks about. By the time cost per task has visibly risen, thousands of sessions have already run expensively. By the time eval pass rate has dropped enough to be statistically clear, the degraded agent has been touching users for a while. Watching cache hit rate and first-action latency closely is what buys you the hours that turn a contained incident into a non-event. The lagging metrics tell you the story afterward; the leading ones let you change the ending.
Avoiding metric theater
One caution: it is easy to build a beautiful dashboard that nobody acts on. Metrics only manage risk if they are wired to a response. Each of the four lines should have a defined threshold, an owner, and a known first action when it trips — not a vague intention to "look into it." A cache-hit-rate alert that pages the context owner with the suspect prefix diff attached is worth more than ten dashboards admired in a weekly review and forgotten. Instrument for action, not for decoration, and the measurement actually earns its place.
Frequently asked questions
What cache hit rate should I aim for?
There is no universal target, but for a long-running agent the great majority of prefix tokens should be cache reads after the first turn, since the prefix is stable by design. The number that matters is the trend: a stable, high hit rate is healthy, and a sudden drop is your most reliable early warning that something volatile entered the prefix.
Why is cost per call a bad metric for agents?
Because an agent completes work across many calls, and caching makes later calls in a task far cheaper than the first. Cost per call hides that structure. Cost per completed task captures the real economics and is the unit that actually changes when caching works or breaks.
How do I catch a silent quality regression from a cache change?
Run a small fixed eval suite against production continuously, scoring known-good tasks. If a caching or prompt change quietly degrades behavior, the eval pass rate drops even though nothing errors. Tie that pass rate to the prefix version so you know exactly which change to roll back.
Bringing agentic AI to your phone lines
Measuring agents in production is exactly how you keep live phone coverage reliable. CallSphere instruments its voice and chat agents for cost-per-resolution, latency, and answer quality so every call is handled well and any regression is caught fast. See the metrics-driven approach at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.