Cutting Claude Code Token Cost: Caching & Batching
Keep Claude Code runs cheap and fast with prompt caching, batched tool calls, context pruning, and model routing across the 1M-token window.
An agent that solves the task but costs ten dollars and four minutes per run is a prototype, not a product. The moment Claude Code becomes part of a real workflow — a CI gate, a support automation, a nightly refactor — token cost and latency stop being abstractions and start being line items. The good news is that most agentic spend is waste, and waste is fixable. This post is about the specific levers that make long-context Claude Code runs cheap and fast without making them dumber.
The core mental model: every token the model reads, on every turn, is paid for again. A 200K-token context that the agent revisits across thirty turns is not a one-time cost. This is why naive long-context usage gets expensive fast, and why the techniques below — caching, batching, pruning, and routing — all aim at the same target: read fewer tokens, fewer times.
Prompt caching: stop paying for the same prefix
Prompt caching is the single highest-leverage optimization for agentic workloads, because agents are built around a large, stable prefix — system instructions, tool definitions, project context — followed by a small, changing tail. Without caching, that entire prefix is re-processed on every turn. With caching, the stable portion is read once and reused at a steep discount on subsequent turns within the cache window.
To benefit, you have to structure the context so the stable parts come first and never change mid-session. Put system prompt, tool schemas, and durable project facts at the front. Keep the volatile material — the latest tool result, the current sub-task — at the end. If you mutate something early in the context, you invalidate the cache from that point onward and pay full price again, so treat the prefix as append-only during a session.
flowchart TD
A["New turn"] --> B{"Prefix unchanged\nsince last turn?"}
B -->|Yes| C["Read cached prefix\nat discount"]
B -->|No| D["Reprocess prefix\nat full cost"]
C --> E["Process small\nchanging tail"]
D --> E
E --> F["Model responds"]
F --> G["Append result,\nkeep prefix stable"]
G --> ABatching tool calls to collapse round-trips
Every tool call that has to complete before the next one can start is a serialized round-trip: the model emits a call, waits for the result, reads it, emits the next. When several actions are independent — reading four files, querying three endpoints — running them one at a time multiplies both latency and the number of turns, and every extra turn re-reads the whole context.
The fix is to issue independent tool calls together so they execute in parallel and return in a single batch. Claude Code can emit multiple tool calls in one turn when it recognizes the actions don't depend on each other, but it does this far more reliably when you tell it to. A line in your instructions like "when gathering information from multiple independent sources, request them all in one turn rather than sequentially" measurably reduces turn count. Fewer turns means fewer full-context reads, which is where the savings compound.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Batching also applies at the orchestration layer. If you fan work out to subagents, give each one a self-contained brief so they run concurrently instead of waiting on a shared bottleneck. Multi-agent runs use more tokens overall, so the parallelism has to buy you enough wall-clock speed and quality to justify it — measure both before committing.
Context hygiene: prune before you pay
The 1M-token window is a budget, not a goal. Just because you can stuff a million tokens of logs and source into the context doesn't mean you should — the model pays attention to all of it and you pay for all of it, every turn. The cheapest token is the one you never put in context.
Practical hygiene looks like this: feed the agent summaries instead of raw dumps where a summary suffices; have it read files on demand through tools rather than pre-loading the whole repository; and compact the running history when a session grows long, replacing fifty turns of exploration with a tight summary of what was learned and decided. A definition worth keeping: context engineering is the discipline of deciding what information enters the model's window, in what form, and when — and it is the largest lever on both cost and quality in long-running agents.
Watch especially for tool results that balloon. A command that prints ten thousand lines, pasted into context, will be paid for on every subsequent turn of the session. Truncate, filter, or summarize large outputs at the tool boundary so the agent gets the signal without carrying the bulk.
Model routing: use the right Claude for the step
Not every step needs the most capable model. The 2026 family — Opus 4.8, Sonnet 4.6, Haiku 4.5 — spans a wide cost-and-capability range, and the cheapest path through a task often routes the hard reasoning to a stronger model and the mechanical steps to a faster, cheaper one. Classification, simple extraction, and routine formatting rarely need top-tier reasoning.
A common pattern is a strong orchestrator that plans and delegates, with lighter subagents doing well-scoped, mechanical work. Another is to start a session on a faster model and escalate only when the task proves genuinely hard. The trap to avoid is routing by guesswork — measure the quality of each step on each model before you downgrade it, because a cheap model that fails and triggers a retry on an expensive one is more costly than just using the expensive one once.
Measuring what you actually spend
You cannot optimize what you don't measure. Instrument every run to record tokens read and written per turn, cache hit rate, tool-call count, and wall-clock time. The numbers usually reveal that a small fraction of turns — typically the ones that re-read a huge context or loop on a flaky tool — account for most of the spend. Fix those and the bill drops disproportionately.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Set a per-run budget and have the agent surface when it's approaching it, so a runaway session halts instead of quietly burning tokens. Combine that with caching, batching, pruning, and routing, and a workflow that cost ten dollars a run routinely falls to a fraction of that — while running faster, because every one of these levers also removes latency.
Frequently asked questions
What gives the biggest token savings in Claude Code?
Prompt caching, by a wide margin, because agents repeatedly read a large stable prefix of system instructions and tool definitions. Structure the context so stable content comes first and stays unchanged during a session, and the cache discount applies to that prefix on every subsequent turn.
Does the 1M-token context window make runs more expensive?
It can, because every token in context is paid for on every turn it stays there. Treat the window as a budget: prune large tool outputs, summarize long histories, and load files on demand rather than pre-loading everything. The cheapest token is the one you never add.
When should I batch tool calls?
Whenever the actions are independent — reading several files, hitting several endpoints. Issuing them in one turn lets them run in parallel and avoids serialized round-trips, each of which would re-read the entire context. Instruct the agent explicitly to gather independent information in a single turn.
Should I always use the most capable model?
No. Route mechanical steps like classification and formatting to a faster, cheaper model and reserve top-tier reasoning for genuinely hard work. Measure quality per step before downgrading, since a cheap failure that forces an expensive retry costs more than using the strong model once.
Bringing agentic AI to your phone lines
Keeping runs cheap and fast is exactly what makes always-on automation viable, and it's how CallSphere runs voice and chat agents at scale — assistants that answer every call and message, call tools mid-conversation, and book work 24/7 without runaway cost. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.