Keeping Claude Agent Runs Cheap: Caching and Batching
Token-cost engineering for Claude agents: prompt caching, the Message Batches API, and context discipline to keep coding runs fast and inexpensive.
Claude's strength on coding tasks is easy to fall in love with and easy to overspend on. An agent that reads the whole repo, reasons across a million tokens of context, and runs dozens of tool calls per task is genuinely capable — and if you run it naively, the bill grows faster than the value. The teams that ship agentic features profitably are not the ones with the cheapest model; they are the ones who treat tokens as a budget they actively manage.
This post is a working engineer's guide to keeping Claude agent runs cheap and fast without dumbing them down. We will cover the three levers that move cost the most — prompt caching, batching, and context discipline — and exactly when each one applies. The goal is concrete: by the end you should be able to look at a run and know where the tokens are going and what to do about it.
Key takeaways
- Prompt caching is the highest-leverage cost lever for agents because the long, stable prefix (system prompt, tools, repo context) is re-sent on every turn.
- Cache hits are dramatically cheaper than fresh input tokens, so structure your prompt with the stable parts first and the changing parts last.
- The Message Batches API trades latency for a large discount on throughput work — ideal for evals, bulk classification, and offline runs, not for interactive turns.
- Context discipline (summarize, prune, scope retrieval) attacks cost at the source by sending fewer tokens in the first place.
- Pick the smallest model that passes your eval bar per task; route the easy turns to Haiku and reserve Opus for the hard reasoning.
Where the tokens actually go
Before optimizing, measure. In a typical Claude coding agent, the input tokens dwarf the output tokens, and the input is dominated by the same content sent over and over: the system prompt, the full set of tool definitions, and whatever repo or document context you loaded. Every turn of a multi-turn run re-sends that prefix plus the growing transcript. A ten-turn task can send your 30,000-token prefix ten times — 300,000 input tokens — even though it never changed.
That repetition is the opportunity. A useful framing: prompt caching is a mechanism that lets the model reuse the computation for an unchanged prefix of the prompt across requests, charging cached tokens at a steep discount instead of full price. Once you see that the prefix is both large and stable, the optimization writes itself — make the prefix cacheable and keep it stable so it stays cached.
The diagram below shows how a turn routes depending on whether its prefix is already cached, and where batching diverts work that does not need to be interactive.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["New agent turn"] --> B{"Stable prefix cached?"}
B -->|Yes| C["Reuse cache: pay discounted rate"]
B -->|No| D["Write cache: pay full + small write cost"]
C --> E{"Latency-sensitive?"}
D --> E
E -->|Yes| F["Live request, stream result"]
E -->|No| G["Queue in Message Batches API"]
G --> H["Collect results, pay batch discount"]Prompt caching: the first thing to turn on
For agents, prompt caching is not a nice-to-have; it is the difference between a feature that pencils out and one that does not. The mechanics are simple: you mark a stable boundary in your prompt, and subsequent requests that share that exact prefix read it from cache instead of reprocessing it. Cache reads cost a small fraction of fresh input tokens, and the only catch is that the cache entry expires after a short idle window, so it pays off most when turns come in quick succession — exactly the pattern of an agent loop.
The rule that unlocks it: order your prompt from most stable to least stable. Put the system prompt and tool definitions first, then long-lived context like loaded files, then the volatile conversation last. Mark the cache breakpoint after the stable block. Here is the shape of a cached request.
{
"model": "claude-sonnet-4-6",
"system": [
{
"type": "text",
"text": "<long stable system prompt + repo conventions>",
"cache_control": { "type": "ephemeral" }
}
],
"tools": [ /* stable tool defs, also covered by the cached prefix */ ],
"messages": [ /* only this grows each turn */ ]
}The mistake that silently disables caching is changing the stable block between turns — injecting a timestamp, reshuffling tool order, or appending a per-turn note into the system prompt. Any byte difference in the prefix is a cache miss and you pay full price. Keep the cached block byte-for-byte identical across turns and put everything dynamic into the messages.
Batching: trade latency for a discount
Not every Claude call needs to answer in two seconds. Eval suites, bulk code review across hundreds of files, offline data extraction, nightly summarization — these are throughput problems, not interactive ones. The Message Batches API is built for exactly this: you submit many requests as a batch, the work runs asynchronously, and you collect results when they are ready, at a substantial discount versus live requests.
The decision is about latency tolerance. If a human is waiting on the result, run it live and stream it. If a machine will consume the result minutes or hours later, batch it. A common high-value pattern is running your whole eval set as a batch every night: you get the cost discount and the latency is irrelevant because nobody is watching the run in real time.
| Workload | Run it live? | Why |
|---|---|---|
| Interactive coding turn | Yes, with caching | A human is waiting; latency matters |
| Nightly eval suite | No, batch it | No one is waiting; take the discount |
| Bulk file classification | No, batch it | Throughput job, latency-insensitive |
| Live chat tool call | Yes, with caching | Round-trip latency is user-visible |
Context discipline: send fewer tokens
Caching and batching make the tokens you send cheaper; context discipline sends fewer of them. This is where the largest sustainable savings live, because it also makes runs faster and often more accurate — a bloated context can bury the relevant detail. The instinct to dump the entire repo into context "just in case" is the single most expensive habit in agent engineering.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Three practices pay off immediately. Scope retrieval: instead of loading every file, let Claude search and read only what it needs for the current step. Summarize on long runs: when a transcript grows past a threshold, replace old turns with a compact summary of decisions and findings so the model keeps the gist without carrying the full history. And prune tool output: a 5,000-line log compressed to the relevant 20 lines saves tokens on this turn and every cached turn after it.
Common pitfalls
- Breaking the cache with dynamic prefixes. A timestamp or per-turn note in the system block means a cache miss every turn. Keep the cached prefix byte-identical.
- Batching latency-sensitive work. If a user is waiting, batching just makes them wait longer for no benefit. Reserve batches for offline throughput.
- Running everything on Opus. Many turns are easy. Route routine turns to Haiku or Sonnet and reserve Opus for genuinely hard reasoning; measure the quality difference before assuming you need the biggest model.
- Never summarizing long runs. An agent that carries a 200-turn raw transcript pays for all of it on every turn. Summarize and prune.
- Optimizing without measurement. Track input vs. output tokens and cache-hit rate per run before you tune anything; you will usually find the cost is somewhere you did not expect.
Cut agent cost in five steps
- Instrument every run to report input tokens, output tokens, cache reads, and turn count.
- Reorder prompts stable-first and add a cache breakpoint after the system prompt and tool definitions.
- Verify cache-hit rate climbs across turns; if it does not, hunt down what is mutating the prefix.
- Move every latency-insensitive workload (evals, bulk jobs) onto the Message Batches API.
- Add context discipline — scoped retrieval, summarization, pruned tool output — and route easy turns to a smaller model.
Frequently asked questions
How much can prompt caching actually save on an agent?
It depends on prefix size and turn count, but agents are the ideal case because the large prefix repeats every turn and the turns arrive quickly enough to stay within the cache window. The longer the run and the larger the stable context, the bigger the win — this is usually the first optimization worth doing.
Can I combine caching and batching?
Yes, and you should where it fits. Caching cuts the per-request input cost; batching cuts the rate for non-interactive jobs. They address different axes, so a batched eval run with a cached stable prefix gets both discounts.
Will summarizing the transcript hurt accuracy?
Done carelessly it can drop a detail the model later needs. Done well — summarizing decisions, open questions, and findings while keeping recent turns verbatim — it usually helps, because the model is not distracted by stale history. Test it against your evals before trusting it on hard tasks.
Should I always use the cheapest model to save money?
No. Use the cheapest model that clears your quality bar for that specific task. Route easy turns down to Haiku and keep Opus for the hard reasoning steps; a failed cheap run that has to be redone on Opus costs more than just using Opus once.
Bringing agentic AI to your phone lines
Cost discipline is not academic when an agent runs thousands of live conversations a day. CallSphere applies the same caching and context-trimming patterns to voice and chat agents so they answer every call fast and affordably, use tools mid-conversation, and book work 24/7. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.