Cutting Agent Token Cost: Caching, Batching, and Cheap Runs

An agent that works is not the same as an agent you can afford to run a million times. The gap between a slick demo and a viable product is almost always token economics. A single agentic task might fan out across twenty model turns, each re-sending a growing transcript, and a multi-agent version can use several times more tokens than a single-agent run for the same job. Multiply that by production traffic and a careless design turns a promising feature into a line item nobody wants to defend in a budget review.

The good news: most agent cost is waste, and the waste is structural. You are usually paying to re-process the same system prompt and tool definitions on every turn, dragging stale context forward, and routing trivial steps through your most expensive model. Fixing those three things — without touching task quality — is where the savings live. This post lays out the cost levers that matter in the Claude ecosystem and the order to pull them in.

Where the tokens actually go

Before optimizing, measure. Instrument each run to record input and output tokens per turn, and you will find a characteristic shape: the input grows roughly linearly with each turn because the full conversation, including every prior tool result, is re-sent on the next call. Output tokens are usually small by comparison. So the dominant cost in a long agent run is re-reading its own history, turn after turn after turn.

That observation reframes the whole problem. You are not mainly paying for the model to think; you are paying it to re-read a transcript that is mostly identical to the one it saw a turn ago. Once you internalize that, the optimization strategy becomes obvious: stop paying full price to re-read unchanged tokens, stop dragging forward tokens you no longer need, and stop using a premium model for turns that don't require premium reasoning.

Prompt caching: stop paying to re-read the static parts

The highest-leverage lever is prompt caching. Your system prompt, tool definitions, and large reference documents are identical on every turn of a run — and often across many runs. Prompt caching lets Claude store that processed prefix and reuse it, so subsequent turns pay a sharply reduced rate for the cached portion instead of full input price. For an agent whose system prompt and tool schemas run to thousands of tokens, this alone can cut input cost dramatically, because that prefix is re-sent on every single turn.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["New turn in agent run"] --> B{"Static prefix cached?"}
  B -->|Yes| C["Reuse cached prefix at reduced rate"]
  B -->|No| D["Process prefix & write to cache"]
  C --> E["Process only new tokens this turn"]
  D --> E
  E --> F{"Run cheaper model for this step?"}
  F -->|Trivial step| G["Route to Haiku"]
  F -->|Hard reasoning| H["Route to Sonnet/Opus"]

To get the benefit, structure the prompt so the stable content sits at the front and the volatile content — the current user turn, the latest tool result — sits at the back. Cache boundaries reward this ordering: anything before the boundary is reused, anything after is reprocessed. A common mistake is interleaving dynamic values into the system prompt, which silently invalidates the cache every turn. Keep the static prefix genuinely static and you keep the cache warm across the entire run.

Prompt caching is a mechanism that stores the model's processed representation of a stable prompt prefix so repeated requests reuse it at a reduced token rate instead of reprocessing it from scratch. It is the difference between paying for a long system prompt once per run versus once per turn.

Batching and model tiering

Not every workload is interactive. If you are running evals, enriching a backlog of records, or generating a thousand summaries, you don't need each result in milliseconds — you need them all done cheaply by morning. The Message Batches API exists for exactly this: submit a large set of requests for asynchronous processing at a substantial discount versus real-time calls. Any agentic workload that isn't blocking a human should default to batch, and the savings compound at scale.

The second tiering lever is model selection within a single run. The Claude family spans Opus 4.8 for the hardest reasoning, Sonnet 4.6 for the everyday workhorse, and Haiku 4.5 for fast, cheap, high-volume steps. A well-built agent rarely needs Opus for every turn. Routing classification, extraction, formatting, and routine tool-result interpretation to Haiku while reserving Sonnet or Opus for genuine planning and judgment can cut cost several-fold with no perceptible quality loss — because those cheap steps were never the ones that needed a frontier model.

Context discipline: the quiet cost killer

The subtlest lever is keeping context lean. Because input grows every turn, anything you let accumulate in the transcript gets re-billed on all subsequent turns. A 50KB tool result that mattered for one turn keeps costing you for the rest of the run if it stays in context. Disciplined agents prune: they summarize long tool outputs down to the few fields that matter, they drop intermediate scratch work once a sub-goal is done, and for very long tasks they compact the history into a running summary.

Claude Code's subagent model is a clean architectural answer here. A subagent runs a focused task in its own context window and returns only a distilled result to the orchestrator, so the heavy intermediate tokens never pollute the main thread. You pay for the deep work once, inside the subagent, and the parent agent only ever sees the conclusion. Used deliberately, this is how you get the power of multi-agent fan-out without paying the naive multi-agent token bill on every step.

Putting the levers in order

Pull these in sequence. Start with prompt caching, because it is nearly free to adopt and pays back immediately on every long run. Next, move all non-interactive work to batching. Then add model tiering, routing cheap steps to Haiku. Finally, impose context discipline — pruning, summarizing, and isolating heavy work in subagents. Measure tokens per run before and after each change so you can prove the savings rather than assume them.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

The throughline is that cheap and fast are mostly engineering, not magic. Quality comes from the model and the prompt; cost comes from how you feed that model. A team that treats token accounting as a first-class concern — measured, optimized, and watched in production — ends up with agents that are not just impressive in a demo but defensible in a budget.

Frequently asked questions

Does prompt caching change the model's output?

No. Caching only changes how the static prefix is billed and processed; the model sees the same content and produces the same quality of output. The single requirement is that the cached prefix be byte-stable across requests, so keep dynamic values out of it and place them after the cache boundary.

When should I use the Batches API instead of real-time calls?

Whenever no human is waiting on the result. Evals, backfills, bulk enrichment, and overnight report generation are ideal — you trade latency for a meaningful per-token discount. Keep real-time calls only for interactive paths where a person is actively blocked on the response.

How much can model tiering really save?

It depends on your mix, but most agents have a long tail of trivial turns — classification, extraction, formatting — that don't need a frontier model. Routing those to Haiku while keeping Sonnet or Opus for planning and judgment commonly cuts run cost several-fold, because the expensive steps were a minority of total turns all along.

Why does my agent's cost grow non-linearly with task length?

Because the full transcript is re-sent on every turn, so each new turn pays to re-read all prior turns. A task that takes twice as many turns can cost well more than twice as much. Context discipline — pruning, summarizing, and isolating heavy work in subagents — flattens that curve.

Bringing cheap, fast agents to your phone lines

Keeping runs cheap matters even more in real time, where a voice agent must respond within a breath. CallSphere applies these same caching, tiering, and context-discipline patterns to voice and chat assistants that answer every call and message, use tools mid-conversation, and book work 24/7. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Cutting Agent Token Cost: Caching, Batching, and Cheap Runs

Where the tokens actually go

Prompt caching: stop paying to re-read the static parts

Batching and model tiering

Context discipline: the quiet cost killer

Putting the levers in order

Frequently asked questions

Does prompt caching change the model's output?

When should I use the Batches API instead of real-time calls?

How much can model tiering really save?

Why does my agent's cost grow non-linearly with task length?

Bringing cheap, fast agents to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Code GTM engineering is heading next

Where Claude Cowork is heading and how to prepare

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild