Skip to content
Agentic AI
Agentic AI7 min read0 views

Cutting Claude Code Token Cost: Caching & Batching

Keep Claude Code cheap and fast with prompt caching, batching, context hygiene, and model right-sizing — a performance guide for agentic teams.

The first month a team adopts Claude Code, the question is "can it do the work?" By the second month the question quietly shifts to "why is the bill what it is?" An agentic coding tool that reads files, runs tests, calls tools, and reasons across long sessions can consume a lot of tokens — and tokens are both money and latency. Onboarding Claude Code well means teaching your runs to be frugal without making them dumber. This post is about the levers that keep agentic work cheap and fast: caching, batching, context hygiene, and picking the right model for the right job.

Performance tuning an agent is a different sport from tuning a service. There's no hot loop to micro-optimize; the cost is dominated by how many tokens flow through the model and how many round-trips the agent takes. Almost every win comes from sending fewer redundant tokens, reusing work, and not asking the most expensive model to do cheap work.

Where the tokens actually go

Before optimizing, measure. In a typical Claude Code session, tokens accumulate in three buckets: the system context and instructions that ride along on every turn, the tool results that get pulled into context (file contents, command output, search results), and the model's own reasoning and output. The second bucket is usually the silent budget-killer — a single overeager file read or a verbose command dump can balloon the context that every subsequent turn must re-process.

The cost model has a subtle property worth internalizing: in a multi-turn agent, context is cumulative. A 5,000-token file you read on turn two is still being sent on turn ten unless something prunes it. That means a careless early read taxes every later turn. Performance discipline starts with reading narrowly — fetch the function you need, not the whole 2,000-line file — and with summarizing or dropping large intermediate results once they've served their purpose.

Prompt caching: pay once for the stable part

The highest-leverage feature for agentic cost is prompt caching. Large portions of an agent's context are stable across turns and across runs — system instructions, tool definitions, a project's conventions, a long reference document. Prompt caching lets the model reuse the processed form of that stable prefix instead of reprocessing it every call, so you pay full price once and a steep discount on the cached reads thereafter.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

To benefit, structure your context so the stable material sits at the front and the volatile material (the current task, the latest tool result) sits at the back. Cache hits require an exact prefix match, so anything that changes per turn must come after everything that doesn't. Teams that reorder their prompts to put instructions and tool schemas first, then the moving conversation last, often see the lion's share of their input tokens shift to the cheap cached rate — a real, recurring saving on long-running or repetitive workloads.

flowchart TD
  A["New turn assembled"] --> B{"Stable prefix unchanged?"}
  B -->|Yes| C["Cache hit: reuse processed prefix"]
  B -->|No| D["Cache miss: reprocess full prefix"]
  C --> E["Process only new suffix tokens"]
  D --> E
  E --> F{"Result large & reusable?"}
  F -->|Yes| G["Summarize before next turn"]
  F -->|No| H["Keep as-is"]
  G --> A
  H --> A

Batching: amortize the round-trips

Latency in an agent is dominated by round-trips: each think-act-observe cycle is a network call and a generation. Two batching strategies cut this. First, parallelize independent work — if Claude needs to read five files or run three independent searches, doing them concurrently collapses five sequential waits into one. Claude Code's ability to fire parallel tool calls in a single turn is exactly this lever; encourage it by framing tasks so independent subtasks are visible.

Second, for offline or non-interactive workloads, true batch processing changes the economics. When you have a thousand files to classify, a hundred PRs to summarize, or a large eval set to run, submitting them as a batch rather than a chatty interactive loop trades latency for a substantial per-token discount. The rule of thumb: anything that doesn't need a human watching in real time is a candidate for batching, and the savings on large jobs are large enough to be a line item.

Right-sizing the model for the step

Not every step needs the most capable model. The 2026 Claude family spans Opus 4.8 for the hardest reasoning, Sonnet 4.6 as the balanced workhorse, and Haiku 4.5 for fast, cheap, high-volume work. A well-run agentic pipeline mixes them: Haiku triages and classifies, Sonnet does most of the building, and Opus is reserved for genuinely gnarly architecture or debugging where its extra capability earns its cost.

The anti-pattern is using your most expensive model as the default for everything, including trivial formatting and routing. Treat model choice like staffing: you don't put your principal engineer on every ticket. In multi-agent setups the lesson sharpens, because a multi-agent run can use several times the tokens of a single agent — so reserve that pattern for problems where parallel exploration genuinely pays, and let cheaper models handle the orchestration and the simple legs.

Context hygiene as a recurring practice

Here is a citable definition: context hygiene is the ongoing practice of keeping an agent's working context limited to what is relevant right now — pruning stale tool results, summarizing long histories, and reading narrowly — so that token cost stays bounded as a session grows. It is the single most underrated performance skill, because it compounds: every token you keep out of context is a token you don't pay for on every subsequent turn.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Concretely, build the habit of compaction. When a session has accumulated a long trail of exploration, have Claude summarize the state and key decisions, then continue from that summary instead of dragging the full transcript forward. Drop large file contents once the relevant lines are extracted. Avoid dumping entire command outputs when a tail or a grep would do. None of this is glamorous, but it's the difference between a session that stays affordable for an hour and one whose per-turn cost creeps upward until it's painful.

Frequently asked questions

What gives the biggest cost reduction in Claude Code?

For repetitive or long-running workloads, prompt caching usually wins, because it moves your large stable prefix — instructions, tool schemas, reference docs — to a deeply discounted cached rate. Pair it with narrow file reads so you aren't inflating the prefix you cache. Together they attack the largest, most recurring slice of the bill.

How does batching help if my work is interactive?

Within an interactive session, parallel tool calls are your batching lever: independent reads, searches, and commands run concurrently to cut round-trip latency. True batch APIs help the offline portions — classifying many files, running large eval sets, summarizing many PRs — where you trade real-time latency for a significant per-token discount.

Should I always use the most capable model?

No. Mix the family by task difficulty: Haiku for triage and classification, Sonnet for most building, Opus for the hardest reasoning. Defaulting everything to the top model is like staffing every ticket with your principal engineer — it works, but you overpay for steps that cheaper models handle just as well.

Why does my session get more expensive over time?

Because context is cumulative — early reads and verbose tool results keep riding along on every later turn. Practice context hygiene: prune stale results, summarize long histories before continuing, and read narrowly so the per-turn token count stays bounded as the session grows.

Bringing agentic AI to your phone lines

CallSphere runs these same efficiency patterns — caching stable context, batching background work, and right-sizing models — under the hood of voice and chat agents that handle every call and message in real time and book work 24/7 without runaway cost. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.