How caching shipped a Claude Code test-fixing agent
End-to-end walkthrough of a Claude Code flaky-test agent that only shipped once prompt caching fixed the multi-turn loop economics.
It is easy to talk about prompt caching in the abstract. It is more useful to watch it decide whether a real project ships or stalls. So let me walk through one end to end: a team that wanted a Claude Code agent to triage and fix a backlog of flaky integration tests across a large monorepo. The problem looked like a model problem. It turned out to be a caching problem, and solving the caching is what turned a $9-a-run prototype into something they could run on every pull request.
The point of this walkthrough is not the test-fixing itself. It is to show the exact sequence — problem, naive build, the wall they hit, the cache redesign, and the shipped outcome — so you can recognize the same arc in your own work and skip a few of the painful steps.
The problem they actually had
The monorepo held roughly two million lines across a dozen services, and the integration suite produced a steady trickle of flaky failures: tests that passed locally and failed in CI for reasons buried in timing, fixtures, and shared state. Engineers spent hours per week chasing these. The team wanted an agent that, given a failing test, could read the relevant code and test, form a hypothesis, and propose a fix as a pull request.
The context the agent needed was large and mostly stable: a description of the repository layout, the testing conventions, the common flakiness patterns the team had documented, the MCP tools for running tests and reading files, and a set of skills for the most common fix types. Loaded together, that was a sizable prefix — well over a hundred thousand tokens — and the agent needed to consult it on every single turn of a multi-step debugging session.
The naive build and where it hit the wall
The first version worked and was unaffordable. Each debugging session ran dozens of turns, and every turn re-sent the entire context at full token price. A single test fix cost several dollars, and the team wanted to run the agent against hundreds of flaky tests and on every PR. The math did not close. Worse, latency was bad: re-processing the full prefix each turn made the agent feel sluggish, which killed developer trust.
This is the wall almost every serious agent hits. The capability is there; the unit economics are not. The instinct at this point is to shrink the context — trim the instructions, drop some skills, summarize the repo map. That is the wrong move, because it trades away the very grounding that made the agent good. The right move is to stop paying full price for the part that never changes.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Flaky test fails in CI"] --> B["Agent loads cached prefix:
repo map + conventions + tools + skills"]
B --> C["Read failing test & code
(volatile suffix)"]
C --> D{"Hypothesis
formed?"}
D -->|No| E["Run test with tracing
via MCP tool"] --> C
D -->|Yes| F["Apply fix on branch"]
F --> G["Re-run test"]
G -->|Still fails| C
G -->|Passes| H["Open PR with
explanation"]
The diagram shows why caching is the linchpin. The whole loop from C back to itself runs many times per test, and each iteration only changes the volatile suffix — the latest test output and file reads. The big prefix at step B is identical every time. If you pay for it once and reuse it, the loop becomes cheap. If you do not, every iteration of the loop is a fresh full-price call.
The cache redesign that unblocked it
The fix was to restructure the prompt so the stable grounding sat in one contiguous block at the front, marked as cacheable, with a clean breakpoint before the volatile region. Everything that changed per turn — the current test, the latest command output, the working hypothesis — moved to the end. They also pulled a few volatile details out of the prefix entirely: the list of currently-failing tests was fetched via a tool at runtime rather than baked into the context, so it could not invalidate the cache.
The effect was immediate. The first turn of a session paid full price to warm the cache; every subsequent turn paid a fraction for the cached prefix and full price only for the small suffix. Because a debugging session is many turns, the average cost per session dropped by a large multiple, and latency improved because the prefix no longer had to be reprocessed each turn. The agent felt faster and cost a fraction as much, without removing any of the grounding that made it accurate.
Shipping it on every pull request
With the economics fixed, the team wired the agent into CI. When a PR introduced or surfaced a flaky test, the agent ran automatically, attempted a fix in a sandboxed branch, and either opened a follow-up PR with its proposed change and reasoning or posted a comment explaining what it found if it could not fix it confidently. A human reviewed every proposed fix; the agent never merged on its own.
The outcome that mattered was not a flashy benchmark. It was that flaky-test toil dropped sharply, the team trusted the agent enough to leave it on, and the cost stayed low enough that nobody questioned the line item. None of that was possible at the naive build's price point. Caching was the difference between a demo and a shipped tool.
The lessons that generalize
Three things from this project transfer to almost any agent. First, design the prompt around what changes, not around what reads nicely: stable grounding in front, volatile state behind a cache breakpoint. Second, move anything that expires out of the cache and into a tool call, so freshness and cache reuse stop fighting each other. Third, judge readiness on cost-per-completed-task in a realistic loop, not on the cost of a single call, because the loop is where caching pays off and where naive estimates mislead you.
The meta-lesson is the one the title promises. The team's hard problem was never "can Claude fix a flaky test." It could from day one. The hard problem was making the loop cheap and fast enough to run thousands of times, and that was entirely a caching and context-layout problem dressed up as a model problem.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The unglamorous work that made it stick
The cache redesign got the agent into CI, but a few unglamorous decisions are what kept it there. The first was bounding the agent's authority. It ran in a sandboxed branch, could not push to main, and could not touch infrastructure. The worst case for a wrong fix was a bad PR a human declined to merge, which is a tolerable failure mode. Pairing the cheap loop with tight scope is what let the team leave the agent on without anxiety.
The second was a small continuous eval. The team kept a fixed set of historical flaky tests with known good fixes and ran the agent against them on a schedule, tracking the pass rate against the current prefix version. When someone later tried to trim the repo conventions out of the cached prefix to save a few thousand tokens, the eval pass rate dropped and the change was reverted before it shipped. That tripwire is what protected the quality the caching work had unlocked.
The third was attribution. Every PR the agent opened recorded the prefix version that produced it. When a batch of fixes looked off, the team could point to the exact context version responsible instead of guessing. None of this is glamorous, but it is the difference between an agent that survives in production and a demo that quietly gets switched off after the third confusing incident.
Frequently asked questions
Why not just use a smaller context to save money?
Because the large context is what makes the agent accurate. Trimming grounding to cut cost trades quality for price. Caching gives you both: you keep the full grounding but only pay full price for it once per session, then a fraction on every reuse. Shrinking the prefix is solving the wrong problem.
How much did caching actually save in this case?
The exact figure depends on session length, but because debugging sessions ran many turns and the prefix dwarfed the per-turn suffix, the per-session cost dropped by a large multiple — enough to move from "too expensive to run broadly" to "fine to run on every PR." The longer the session, the bigger the win.
Does this pattern only apply to coding agents?
No. Any agent with a big stable context and a long multi-turn loop benefits identically — support agents, research agents, voice agents. The shape is universal: pay once for the durable grounding, reuse it across the loop, keep volatile facts in tool calls.
Bringing agentic AI to your phone lines
The same loop economics decide whether a live phone agent is viable. CallSphere builds voice and chat agents that hold a rich, cached operating context across long conversations while staying fast and affordable, calling tools mid-call and booking work 24/7. See the walkthrough live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.