Cut Claude Code Token Costs: Caching and Batching
Cut Claude Code agent costs with prompt caching, batching, context hygiene, and model routing — keep multi-turn runs cheap and fast without losing quality.
An agent that works is not the same as an agent you can afford to run a thousand times a day. The moment a Claude Code workflow moves from a demo to production traffic, the conversation shifts from "does it complete the task" to "what does each completion cost and how long does it take." I've watched a perfectly good agent quietly rack up a token bill ten times larger than it needed, simply because nobody had looked at where the tokens were going. Performance work on agents is mostly the unglamorous discipline of not wasting context.
This post covers the levers that actually move the needle on cost and latency for Claude agents: prompt caching, batching, context hygiene, and model routing. None of them require exotic infrastructure. They require knowing where your tokens go and being deliberate about every one of them.
Know where the tokens go first
Before optimizing, measure. An agent's cost is dominated by input tokens, not output tokens, because every turn re-sends the growing conversation history — system prompt, tool definitions, prior messages, and tool results — to the model. A ten-turn run doesn't pay for the context once; it pays for it ten times, accumulating. This is the single most important fact about agent economics, and it explains why the biggest savings come from the input side.
Instrument each run to record input tokens, output tokens, and cache reads per turn. Once you can see the curve, the waste becomes obvious: a 40,000-token system prompt re-sent on every one of twelve turns is half a million tokens before the agent does anything useful. That is exactly the kind of cost caching was built to erase.
Prompt caching: the highest-leverage lever
Prompt caching lets you mark a stable prefix of your request — system prompt, tool definitions, large reference documents — so that on subsequent calls the model reads it from cache at a steep discount instead of reprocessing it. For agent loops, where the prefix is identical turn after turn, this is transformative. The structure that wins is simple: put everything static at the front and mark the cache boundary, then let only the genuinely new content (the latest user message or tool result) vary at the end.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["New turn"] --> B{"Static prefix unchanged?"}
B -->|Yes| C["Read prefix from cache (cheap)"]
B -->|No| D["Reprocess full prefix (full price)"]
C --> E["Process only new tokens"]
D --> E
E --> F["Model responds"]
F --> G{"More turns?"}
G -->|Yes| A
G -->|No| H["Run complete"]
The discipline caching demands is ordering. If you interleave volatile content into your prefix — a timestamp, a per-turn counter, a freshly shuffled tool list — you invalidate the cache and pay full price every time. Treat the cached prefix as immutable for the duration of a run. Keep tool definitions stable; don't regenerate their descriptions dynamically. Put the conversation's moving parts strictly after the cache breakpoint. Teams that get this right routinely cut input cost on multi-turn runs dramatically, because the expensive part is read cheaply on every turn after the first.
Batching independent work
The second lever is batching. Many agent workloads contain work that is genuinely independent — classify these 500 tickets, summarize these 200 documents, extract fields from these 1,000 records. Running those one synchronous call at a time is slow and gives up throughput. Batch processing lets you submit a large set of independent requests for asynchronous handling, typically at a meaningful discount and without holding open a live connection per item.
The rule for batching is that the items must not depend on each other. A conversational agent turn can't be batched — turn two needs turn one's result. But the fan-out phase of a multi-agent run often can. If an orchestrator spawns ten subagents to investigate ten independent leads, those investigations are batch-friendly. Designing your workflow to separate the sequential spine from the parallelizable leaves is what makes batching possible in the first place.
Context hygiene: stop carrying dead weight
Even with caching, a context that grows without bound gets expensive and, worse, gets worse — models attend less reliably over very long contexts. Practice active context hygiene. When a tool returns a 50,000-token document and the agent only needs three facts from it, don't keep the whole document in the running history; summarize it down to the facts and discard the raw payload. For long-running agents, periodically compact the conversation: replace a stretch of resolved back-and-forth with a concise summary of what was decided.
Retrieval helps here too. Rather than stuffing an entire knowledge base into every prompt, fetch only the relevant chunks per turn. The goal is a context that contains what the current decision needs and little else. Smaller contexts are cheaper, faster, and produce sharper decisions — a rare case where the cost optimization and the quality optimization point the same direction.
Route the model to the task
Not every step needs your most capable model. A run might use Opus 4.8 for the hard planning and reasoning steps, then drop to Sonnet 4.6 or Haiku 4.5 for mechanical subtasks — formatting output, classifying a short string, extracting a field. Model routing matches the model's cost to the difficulty of each step. In a multi-agent system this maps cleanly: a capable orchestrator delegates to cheaper, narrowly-scoped subagents that don't need frontier-level reasoning to do their one job.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Be empirical about it. Promote a step to a bigger model only when a cheaper one demonstrably fails your evals on that step, and demote whenever a cheaper model passes. The combination — cache the static prefix, batch the independent work, prune the context, and right-size the model per step — is what turns an expensive prototype into an agent you can run at scale without flinching at the invoice.
Frequently asked questions
What makes Claude agent runs expensive?
Input tokens dominate, because every turn re-sends the entire growing conversation — system prompt, tool definitions, and prior results — to the model. A ten-turn run pays for that context roughly ten times. Reducing and caching the input side is where the savings live.
How does prompt caching reduce cost?
It lets you mark a stable prefix (system prompt, tool definitions, reference docs) so subsequent calls read it from cache at a steep discount instead of reprocessing it. Keep that prefix immutable during a run and put all volatile content after the cache breakpoint, or you invalidate the cache and pay full price.
When should I use batching versus live calls?
Use batching for independent work — classifying, summarizing, or extracting over many items that don't depend on each other — to get higher throughput and a discount. Use live calls for sequential conversation turns, where each step needs the previous step's result.
Does using a cheaper model hurt quality?
Only if you route badly. Use a capable model for hard reasoning and planning steps and a cheaper one for mechanical subtasks, and let evals decide: promote a step to a bigger model only when the cheaper one measurably fails, and demote when it passes.
Agentic AI that pays for itself on every call
Cheap, fast agent runs are exactly what make 24/7 voice automation viable at scale. CallSphere applies these same agentic-AI efficiency patterns to voice and chat — assistants that answer every call and message, call tools mid-conversation, and book work without running up runaway costs. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.