Cutting Claude Code Token Costs: Caching & Batching

Agentic runs have a way of surprising you on the invoice. A single Claude Code workflow that rebuilds a slice of your go-to-market machine can read dozens of files, call tools repeatedly, and carry a fat context through many turns — and every one of those turns re-sends a pile of tokens to the model. The work is genuinely valuable, but if you never look at the cost curve, a workflow that ran for pennies in a demo can quietly cost real money when it runs hundreds of times a day in production. The good news is that most agent cost is structural and very compressible once you understand where the tokens actually go.

This post is about keeping Claude Code runs cheap and fast at the same time — because in agentic systems, cost and latency are the same problem wearing two hats. Tokens you do not send are both money you do not spend and milliseconds you do not wait. We will cover the four biggest levers: prompt caching, batching, context scoping, and model selection, and how they interact when you are rebuilding a team's workflows for the long haul.

Where the tokens actually go

The first instinct is to blame the model's output, but in agent workflows output is rarely the cost driver. The driver is input tokens re-sent across turns. Every time the agent takes an action, the entire conversation so far — system prompt, tool definitions, prior reasoning, every tool result — is sent again so the model can decide the next step. A ten-turn run does not send the context once; it sends an ever-growing context ten times. That accumulation is why a long agent loop can cost far more than its output length suggests.

So the question for cost work is not "how do I make the model talk less" but "how do I stop re-paying full price for the same input on every turn, and how do I keep that input from ballooning in the first place." Prompt caching answers the first; context scoping answers the second.

Prompt caching: stop paying twice for the stable parts

Prompt caching lets you mark the stable prefix of your prompt — system instructions, tool definitions, large reference documents, your skill content — so that on repeat calls the model reuses the already-processed version instead of reprocessing it from scratch. Cached input tokens are billed at a steep discount compared to fresh ones, and they also process faster, which cuts latency. In an agent loop where the system prompt and tool schemas are identical on every single turn, caching that prefix is close to free money.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Turn starts"] --> B{"Stable prefix cached?"}
  B -->|Yes| C["Reuse cached tokens, cheap & fast"]
  B -->|No| D["Process full prefix, write cache"]
  C --> E["Append only new turn tokens"]
  D --> E
  E --> F["Model decides next action"]
  F --> G{"More turns?"}
  G -->|Yes| A
  G -->|No| H["Run complete"]

The practical rule is to order your prompt from most stable to most variable. Put the fixed system prompt and tool definitions first, then long shared context, then the per-run specifics, then the live conversation. Caching keys off a stable prefix, so anything you change near the top invalidates the cache for everything after it. A common mistake is interpolating a timestamp or a record ID into the system prompt — that one variable token at the top can quietly defeat caching for the whole run. Keep the volatile bits at the bottom where they belong.

Batching: amortize the fixed costs

Batching is the second lever, and it works on two levels. At the data level, if your workflow processes many independent items — scoring a list of leads, classifying a stack of support tickets, drafting a set of follow-ups — do not run a fresh agent per item when one agent can handle a sensible batch in a single context that pays the system-prompt and tool-definition cost once. At the request level, when you have a large volume of non-urgent calls, an asynchronous batch path trades immediacy for a meaningful per-token discount, which is ideal for overnight enrichment or weekly report generation that no human is waiting on.

The judgment call is batch size. Too small and you waste the fixed overhead on every item; too large and a single batch's context grows so big that quality drops and the run becomes fragile — one bad item can derail the rest. For most GTM tasks, batches of a few dozen items per agent context hit a sweet spot, with truly large jobs split across parallel subagents so total wall-clock time stays low even as throughput climbs.

Context scoping: the cheapest token is the one you never send

Caching makes re-sent tokens cheaper; scoping prevents them from existing. The biggest waste in real agent workflows is dragging irrelevant material through the whole run — pasting an entire 200-row export when the agent needs four columns, or keeping verbose tool results in context long after they have been used. Trim tool outputs to what the next decision actually requires. Summarize and discard intermediate results once they have served their purpose. When a workflow has natural phases, consider letting a subagent do a heavy, context-hungry phase and return only a compact summary to the orchestrator, so the expensive context dies with the subagent instead of bloating the main run.

This is also where Agent Skills and MCP pay off on cost, not just capability. A skill loads its detailed instructions only when the task is actually relevant, rather than parking a giant always-on instruction block in every prompt. That lazy loading keeps the baseline context small, which compounds with caching to keep per-turn cost low across a long run.

Model selection: right-size the brain for the step

Not every step deserves your most capable model. The Claude family spans tiers — a high-capability model like Opus for hard reasoning and orchestration, a balanced model like Sonnet for most production work, and a fast, inexpensive model like Haiku for high-volume, well-defined steps like classification, extraction, or routing. A mature GTM workflow mixes them: a strong model plans and handles ambiguity, while cheaper, faster models do the bulk grunt work under its direction. Routing the easy 80 percent of calls to a smaller model and reserving the expensive model for the genuinely hard 20 percent often cuts total cost dramatically with no visible quality loss, because the smaller models are very good at narrow, well-specified jobs.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Measure, then optimize

Do not guess at any of this. Log per-run input tokens, output tokens, cache hit rate, and tool-call counts, and look at the distribution, not the average — a few pathological runs usually dominate the bill. When you find an expensive run, read its transcript and ask which tokens earned their place. Often the fix is unglamorous: a tool returning ten times more data than the model uses, a system prompt that lost cache because of a stray variable, or a loop that should have been three turns and took twelve. Cost optimization in agentic systems is mostly disciplined measurement plus the four levers above, applied where the data says they will pay off.

Frequently asked questions

Does prompt caching change the model's answers?

No. Caching reuses the already-processed representation of identical input tokens; the model sees the same content it would have otherwise. It changes price and speed, not the output, as long as the cached prefix is genuinely identical between calls.

When should I use batch processing instead of real-time calls?

Use the asynchronous batch path whenever no human is waiting on the result — overnight enrichment, scheduled reports, bulk classification. You trade immediate responses for a per-token discount, which is exactly the right trade for background GTM jobs.

What is the single highest-impact cost fix for most agent workflows?

Caching the stable prefix in a multi-turn loop, closely followed by trimming oversized tool results. Together they attack the dominant cost — input tokens re-sent every turn — and usually move the bill more than any prompt rewrite.

Will using a cheaper model for some steps hurt quality?

Not if you route by difficulty. Smaller models excel at narrow, well-specified tasks like extraction and routing; quality only suffers if you hand them open-ended reasoning. Keep the capable model for ambiguity and orchestration and the trade is nearly invisible.

Bringing agentic AI to your phone lines

CallSphere runs these same efficiency tactics on live voice and chat — cached context, right-sized models, and tight tool outputs so agents answer every call fast and at a cost that scales. See it live at callsphere.ai.

Cutting Claude Code Token Costs: Caching & Batching

Where the tokens actually go

Prompt caching: stop paying twice for the stable parts

Batching: amortize the fixed costs

Context scoping: the cheapest token is the one you never send

Model selection: right-size the brain for the step

Measure, then optimize

Frequently asked questions

Does prompt caching change the model's answers?

When should I use batch processing instead of real-time calls?

What is the single highest-impact cost fix for most agent workflows?

Will using a cheaper model for some steps hurt quality?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

How to measure success of Claude Code GTM workflows

Measuring Claude Cowork success: metrics that prove it

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild