Cutting Token Cost in Claude Agents: Caching & Batching
Keep Claude agent orchestration cheap and fast with prompt caching, batching, model routing, and lean context that cut token bills without losing quality.
A multi-agent system built on Claude can do remarkable work, and it can also quietly burn through a budget. A single orchestrated run that fans out to five subagents, each carrying a fat system prompt and re-reading the same documents, can cost an order of magnitude more than the single-agent version of the same task. The capability is worth paying for — but most of that bill is waste, not value. This post is about finding and removing the waste: where the tokens actually go in an orchestration run, and the specific levers that make runs cheaper and faster without dumbing them down.
Where the tokens actually go
Before optimizing, measure. In almost every orchestration system the cost breaks down into three buckets, and intuition usually gets the ratio wrong. The first bucket is repeated context: the same system prompt, tool definitions, and reference documents shipped on every turn and to every subagent. The second is tool result bloat: a search tool returns 8,000 tokens of JSON when the agent needed three fields. The third is over-reasoning: spawning a subagent or running extra turns for work a single focused prompt could have finished.
Instrument cost the way you'd instrument latency. Log input and output tokens per step, attribute them to the agent and tool that caused them, and roll them up per run. Once you can see that 60% of your spend is the same instructions re-sent forty times, the priorities sort themselves out. The biggest win in agent orchestration is almost always eliminating repeated context, and Claude gives you a direct tool for that.
Prompt caching: stop paying for the same tokens twice
Prompt caching is the highest-leverage cost lever in the Claude ecosystem, and it is underused. The idea is simple: the stable prefix of your context — system prompt, tool definitions, long reference material, few-shot examples — gets cached on the first request, and subsequent requests that reuse that exact prefix read from the cache at a steep discount instead of paying full input price again. In an orchestration loop where the same system prompt rides along on every one of twenty turns, this turns a linear cost into something far flatter.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
To benefit, you have to structure the context deliberately. Put everything stable at the front and everything that changes — the latest tool result, the current task state — at the back, so the cacheable prefix stays byte-identical across turns. The classic anti-pattern is interleaving a timestamp or a turn counter near the top of the prompt; it invalidates the cache on every single call and you pay full freight forever. Treat your prompt layout as a cache-design problem: stable, large, and reused content first; volatile content last.
flowchart TD
A["New agent turn"] --> B["Assemble context"]
B --> C{"Stable prefix unchanged?"}
C -->|Yes| D["Cache hit: cheap prefix"]
C -->|No| E["Cache miss: full input cost"]
D --> F["Append volatile suffix only"]
E --> F
F --> G["Call Claude"]
G --> H{"Trim tool output before next turn?"}
H -->|Yes| I["Keep summary, drop raw"] --> ABatching independent work
When subagents perform genuinely independent units of work — enrich fifty leads, classify two hundred tickets, summarize forty documents — running them one at a time wastes both wall-clock time and the fixed overhead of repeated setup. Two batching strategies apply. For latency, fan the independent tasks out concurrently so they overlap instead of queueing; Claude Code's parallel subagents are built for exactly this shape of work. For cost on large non-interactive jobs, the Anthropic platform's asynchronous batch processing trades immediacy for a meaningful discount — ideal for overnight enrichment or backfills where a result in an hour is fine.
The discipline that makes batching pay off is recognizing independence honestly. If task B needs task A's output, they aren't a batch, they're a pipeline, and forcing them concurrent just creates coordination overhead. Map the dependency graph first: the nodes with no edges between them are your batch.
Route models to the work
Not every step deserves the most capable model. A mature orchestration system uses a mix: a strong model like Opus for the orchestrator's planning and the genuinely hard reasoning, a mid model like Sonnet for most subagent execution, and a fast, inexpensive model like Haiku for high-volume mechanical steps — classification, extraction, formatting, routing decisions. The orchestrator picks the model per subagent based on the difficulty of the assigned task.
The trap is reaching for the biggest model by default "to be safe." That habit is where budgets die. Build an eval set first, then downgrade each step to the cheapest model that still passes, and only upgrade the steps that measurably need it. Most teams discover that the majority of their subagent calls run perfectly well on a smaller model, and that the savings compound across every run.
Keep context lean across turns
The final lever is context discipline over the life of a run. A long orchestration accumulates history — every tool result, every intermediate message — and if you naively carry all of it forward, each turn gets more expensive than the last and the model's attention degrades. The fix is to summarize and compact: after a tool returns a large payload and the agent has extracted what it needs, replace the raw payload in the running context with a short summary of the relevant facts. Keep the decisions and the data; drop the verbose intermediate. This keeps both cost and quality up, because a tight context is one the model reasons over more reliably than a sprawling one.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Frequently asked questions
How much can prompt caching actually save?
It depends entirely on how much of your context is stable and reused, but in orchestration loops — where a large system prompt and tool set ride along on every turn — the cacheable portion is often the majority of input tokens, so the savings on cached reads are substantial. Structure the prompt so the stable prefix never changes and you capture most of that benefit.
When should I use batch processing instead of parallel subagents?
Use parallel subagents when you need the results promptly and want them to overlap in a live run. Use asynchronous batch processing for large, non-time-sensitive jobs where waiting is acceptable in exchange for a lower price — overnight enrichment, backfills, bulk classification.
Is multi-agent always more expensive than single-agent?
Generally yes — fanning out to subagents multiplies the repeated context and coordination overhead, so multi-agent runs typically use several times more tokens. The justification is capability and parallel speed on hard, decomposable tasks, not cost. Reach for it deliberately, and lean on caching and model routing to keep the multiplier in check.
Faster, cheaper agents on your phone lines
CallSphere applies this same cost discipline — cached prefixes, model routing, lean context — to voice and chat agents that answer every call and message and book work 24/7, keeping latency low and runs affordable. See it in action at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.