Skip to content
Agentic AI
Agentic AI7 min read0 views

Cutting Claude Code Token Cost in Large Codebases

Prompt caching, batching, and context discipline that keep Claude Code runs cheap and fast on large codebases without sacrificing agentic work quality.

Run an agentic coding tool against a real monorepo for a week and the bill stops being abstract. A single careless run can read the same 8,000-line file three times, re-send a 40-kilobyte system prompt on every turn, and fan out four subagents that each re-discover the same project layout. The work gets done, but you've paid for the same tokens over and over. The good news: nearly all of that waste is structural, and the structure is fixable.

This post is about making Claude Code runs cheap and fast on large codebases. The three biggest levers are prompt caching, batching, and ruthless context discipline. Get those right and the same task that cost dollars and minutes drops to cents and seconds.

Where the tokens actually go

Before optimizing, look at a real trace. In a large-repo run, token spend clusters in a few places. The static preamble — system prompt, tool definitions, project conventions, loaded skills — is re-sent on every model call, and there can be dozens of calls in one task. Retrieved code is the next big bucket: every file read, grep result, and directory listing lands in context and stays there. Then there's the conversation tail, which grows with each turn until a 40-step run is dragging a huge history into every prediction.

The key insight is that input tokens dominate. Output — the patches Claude writes — is usually a small fraction of the total. So the cheapest run is not the one that writes less code; it's the one that reads each thing once, reuses its stable prefix, and never carries dead context forward. That reframing points straight at the three levers.

Lever one: prompt caching

Prompt caching is the highest-leverage optimization for agentic work, and it is almost free to adopt. Anthropic's API lets you mark a stable prefix of the prompt — system instructions, tool schemas, large reference files — so that on subsequent calls the model reuses the cached computation instead of re-processing those tokens. Cached input tokens are billed at a steep discount and processed faster, which matters enormously in a loop that calls the model 30 times with the same 30-kilobyte preamble.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Build prompt"] --> B["Stable prefix:\nsystem + tools + skills"]
  B --> C{"Cache hit on prefix?"}
  C -->|Yes| D["Reuse cached tokens\n(cheap + fast)"]
  C -->|No| E["Process & write cache"]
  D --> F["Append turn-specific\ncontext & tool results"]
  E --> F
  F --> G["Model produces next action"]
  G --> A

The discipline that makes caching pay off is prompt ordering: put everything stable first and everything volatile last. If you interleave a changing timestamp or a per-turn note into the middle of your system prompt, you invalidate the cache below that point on every call and lose the benefit. Structure the prompt as fixed-prefix then variable-suffix, and keep large, reusable reference material — a coding-standards doc, an architecture overview — inside the cached region so the agent pays for it once per session, not once per turn.

Lever two: batching

Batching attacks a different waste: round-trips. Every time the agent reads one file, thinks, then reads another, you pay the full context cost twice and add latency for two model calls. When the agent already knows it needs five files — say, a component and its four imports — fetching them in one batched operation collapses that into a single round-trip with one shared context pass.

This is where parallel subagents earn their keep, but you have to spend them wisely. Spawning subagents for genuinely independent chunks — "audit the payments module" alongside "audit the auth module" — is true parallelism that shortens wall-clock time. Spawning them for tasks that all need the same shared context just multiplies the preamble cost, since each subagent re-establishes its own working picture. The rule: parallelize across independent regions of the codebase, and keep tightly coupled work inside one agent where the context is already loaded. Multi-agent runs can consume several times the tokens of a single agent, so reach for them deliberately, not reflexively.

Lever three: context discipline

The slowest, most expensive runs are the ones that never let go of anything. Claude reads a giant file to check one function, and now that whole file rides along in context for the next 25 turns. Context discipline means actively keeping the working set small. Read slices, not whole files, when you know the line range. Prefer tools that return structured, compact results — a "find references" call that yields four locations beats a grep that dumps 300 lines. And summarize-then-discard: when a subagent finishes mapping a module, fold its findings into a two-paragraph summary and drop the raw exploration.

Claude Code's large context window is a tool, not a license. A 1M-token window means you can hold a lot, but holding it is what you pay for and what slows each prediction. The fastest agents treat context like a hot cache: small, fresh, and aggressively evicted. When a task naturally splits, hand the next phase to a fresh agent with a clean summary rather than dragging the entire history forward.

Measuring and gating cost

You cannot optimize what you do not measure. Log token usage per run, broken out by cached versus uncached input and by output. Watch the ratio of cache hits — if a long session shows a low hit rate, your prompt ordering is probably invalidating the prefix. Track files-read-per-task; a number that climbs over time signals re-reading and weak context discipline. For CI-driven agent jobs, set a token budget and have a hook halt or escalate when a run blows past it, so a single looping job can't run up a surprise bill overnight.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

None of this trades quality for cost. Caching changes nothing the model sees; batching and context discipline actually improve output by keeping the working picture clean. Cheap and fast and good are, for once, the same direction.

Frequently asked questions

How much can prompt caching save?

It depends on how much of your prompt is stable and how many model calls a task makes, but for agentic loops with a large fixed preamble, cached input tokens are billed at a steep discount and reused on every turn — so multi-turn runs often see most of their input cost collapse onto the cached prefix.

Do parallel subagents save money?

They save wall-clock time, not necessarily tokens. Each subagent establishes its own context, so multi-agent runs typically use several times more tokens than a single agent. Use them for genuinely independent regions of the codebase, and keep coupled work in one agent.

What is prompt caching, exactly?

Prompt caching is a feature that stores the processed form of a stable prompt prefix so repeated calls reuse it at reduced cost and latency instead of reprocessing those tokens. It pays off most when a fixed system prompt and tool set are sent across many turns.

Does keeping context small hurt the agent's reasoning?

No — it usually helps. A smaller, fresher working set keeps the model grounded and reduces loops and hallucinated arguments. Summarize finished exploration and discard the raw output rather than carrying everything forward.

Bringing agentic AI to your phone lines

Cheap, fast, disciplined agent runs are exactly what real-time voice demands. CallSphere brings these same caching and context patterns to voice and chat agents that answer every call, act on tools mid-conversation, and stay responsive at scale. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.