Cutting Claude agent token costs: caching and batching
Slash Claude Code agent costs with prompt caching, model tiering, batching, and lean context. A practical guide to keeping agentic runs cheap and fast.
The first invoice was a wake-up call. My six-week app worked, users liked it, and the token bill was three times what I had budgeted. I had built it the obvious way — every request rebuilt the full context, every step used the strongest model, and the agent re-read the same large reference files on every turn. None of that was necessary. Over the next two weeks I cut the cost by more than half without touching a single feature, just by treating tokens as a resource to engineer rather than an afterthought.
Where the money actually goes
Before optimizing anything, I instrumented the agent to log tokens per step. The result was clarifying: the bulk of my spend was not the model thinking hard about novel problems. It was the same large, unchanging context — system instructions, tool definitions, a few reference documents — being sent again and again on every single turn of every single conversation. The novel part of each request was tiny. The repeated part was enormous, and I was paying full price for it every time.
Token cost optimization for agents is the practice of minimizing how many tokens each step consumes by reusing stable context, choosing the right model for the work, and avoiding redundant tool calls. Once I saw the breakdown, the strategy wrote itself: stop paying repeatedly for things that do not change, and stop using a heavyweight model for lightweight steps.
Prompt caching: pay once for the stable parts
The single biggest lever was prompt caching. A large agent prompt is mostly stable — the same system instructions and tool schemas across every turn — with only a small dynamic tail that changes. Prompt caching lets you mark that stable prefix so it is processed once and then reused at a steep discount on subsequent calls, as long as the prefix is identical. Structuring the prompt so the unchanging content sits at the front and the volatile content sits at the back is what makes caching pay off.
This had an architectural consequence I did not expect: it rewarded stability. Anything that perturbed the prefix — a timestamp injected near the top, reordered tool definitions, a per-request greeting — broke the cache and quietly doubled cost. So I moved all volatile content to the tail and froze the prefix. The lesson generalizes: design your context so the expensive, reusable part never moves.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Incoming request"] --> B{"Stable prefix cached?"}
B -->|Yes| C["Reuse cached prefix at a discount"]
B -->|No| D["Process full prefix once & cache it"]
C --> E{"Step complexity?"}
D --> E
E -->|Simple| F["Route to Haiku"]
E -->|Hard| G["Route to Sonnet or Opus"]
F --> H["Return result & log tokens"]
G --> H
Model tiering: stop using a sledgehammer for everything
My second mistake was using the most capable model for every step. In an agentic workflow, most steps are not hard. Classifying an intent, extracting a field, deciding which of three branches to take — these are easy, and a smaller, faster model handles them at a fraction of the cost and latency. The 2026 Claude family makes this practical: route trivial steps to Haiku 4.5, route most work to Sonnet 4.6, and reserve Opus 4.8 for the genuinely hard reasoning where its capability earns its price.
The practical pattern is a cheap router. A small model looks at the task and decides whether it is simple enough to handle itself or needs to escalate. Most traffic never escalates, so most traffic is cheap and fast. I was surprised how much of my workload turned out to be routine once I measured it — the hard problems were a small minority, and I had been paying premium rates on all of it.
Batching and avoiding redundant work
The third lever was eliminating redundancy. My agent re-read the same reference file on nearly every turn because nothing told it the content was stable. Caching the file once per session and passing a compact summary thereafter cut a surprising amount of waste. Where I had many similar independent tasks — enriching a list of records, for instance — batching them into fewer, larger calls beat issuing one call per item, because the fixed per-call overhead got amortized.
There is also a multi-agent caution here. Spawning parallel subagents is powerful, but multi-agent runs typically use several times more tokens than a single agent, because each subagent carries its own context. That is sometimes worth it for genuinely parallel research. It is wasteful when one agent could have done the job linearly. I made a rule: reach for multiple agents only when the work is truly independent and the latency win justifies the token premium.
Keeping context lean as the app grows
Even with caching and tiering, context quietly bloats over time. Long conversations accumulate history; agents pull in more reference material than they need. The fix is active context management. I summarize older turns once they are no longer load-bearing, so the agent keeps the gist without re-sending every word. I keep tool results out of the running context once they have been used. And I lean on Skills to load instructions only when relevant, rather than stuffing every possible instruction into the base prompt where it costs tokens on every call whether used or not.
The mental model that helped most was thinking of context as a working set, not an archive. The agent needs what is relevant right now, retrieved or summarized on demand — not the entire history of the session shipped on every turn. Lean context is cheaper, and it is also more reliable, because a smaller, sharper context produces better decisions than a sprawling one.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Frequently asked questions
How much can prompt caching realistically save?
It depends on how much of your prompt is stable, but agent workloads are unusually cacheable because system instructions and tool schemas repeat on every turn. When the stable prefix dominates, caching can cut the cost of that portion dramatically. The key is keeping the prefix byte-identical across calls so the cache actually hits.
Does using a cheaper model hurt quality?
Only if you route hard steps to it. The win is matching model to task: a smaller model for classification and extraction, a stronger one for genuine reasoning. A cheap router that escalates when needed gives you most of the savings with little quality loss, because the easy steps were never the ones at risk.
When is a multi-agent setup worth the extra tokens?
When subtasks are genuinely independent and running them in parallel meaningfully cuts wall-clock time. Multi-agent runs cost several times more tokens than a single agent, so use them deliberately. If one agent could do the work sequentially without an unacceptable delay, that is usually the cheaper and simpler choice.
What is the first thing I should measure?
Tokens per step, broken down into stable versus dynamic content. That single view tells you whether to invest in caching, model tiering, or context trimming first. Most teams discover their spend is dominated by repeated, unchanging context — which is the easiest thing to fix.
Bringing agentic AI to your phone lines
Keeping runs cheap and fast is exactly what makes real-time voice viable — every wasted token is latency a caller feels. CallSphere brings these same agentic efficiency patterns to voice and chat, with assistants that respond instantly, use tools mid-call, and stay economical at scale. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.