Cutting Claude Agent Token Cost: Caching & Batching (Deploy Cowork Across Enterprise)
Keep Claude Cowork and Agent SDK runs cheap and fast with prompt caching, batching, context pruning, and per-step model routing.
Agentic systems have a quiet economics problem. A single-agent task that costs a few cents in a demo becomes a multi-agent workflow that re-reads the same 40-page policy document on every turn, fans out to five sub-agents, and runs ten thousand times a day across a department. Token cost and latency both compound. The good news: most of that cost is waste, and the levers to remove it — caching, batching, context discipline, and model selection — are well understood. This post is about keeping Claude Cowork and Agent SDK runs genuinely cheap and fast without dumbing the agent down.
Key takeaways
- Prompt caching is the highest-leverage lever: cache the stable prefix (system prompt, skills, tool schemas) so you pay full price for it once, not every turn.
- Multi-agent runs often use several times more tokens than single-agent — reserve fan-out for tasks that genuinely parallelize.
- Context grows quadratically in cost as a conversation lengthens; prune and summarize aggressively instead of carrying every tool result forever.
- Use the Message Batches API for non-interactive bulk work to cut cost and avoid rate-limit thrash.
- Route by difficulty: Haiku for cheap routing/extraction, Sonnet for most work, Opus only where its capability pays for itself.
Where the tokens actually go
Before optimizing, measure. In a typical Cowork run, tokens accumulate in four places: the static preamble (system prompt, loaded skill instructions, tool/connector schemas), the growing transcript (every prior turn plus every tool response), the model's own reasoning and outputs, and — in multi-agent setups — the duplicated context each sub-agent receives. The static preamble is read on every single turn, so a 6,000-token preamble in a 12-turn run is read twelve times. That is where caching earns its keep.
The transcript is the sneaky one. Each turn appends the model's output plus the full tool response, and the entire history is re-sent on the next turn. A connector that returns a 5,000-token JSON blob doesn't cost you 5,000 tokens once — it costs you 5,000 tokens on every subsequent turn it stays in context. Long agent runs get expensive not because any single call is large, but because context is paid for repeatedly.
Prompt caching: pay for the prefix once
Prompt caching lets you mark a stable prefix of the request so that Claude stores it and, on subsequent matching requests, reads it at a steep discount instead of full input price. Because an agent re-sends the same system prompt, skills, and tool schemas on every turn, caching that prefix is the single biggest win available. The rule is simple: order your request so everything stable comes first and everything that changes comes last, then mark the cache breakpoint at the boundary.
flowchart TD
A["Agent turn request"] --> B{"Stable prefix cached?"}
B -->|Cache hit| C["Read prefix at discount"]
B -->|Cache miss| D["Pay full price, write to cache"]
C --> E["Process only new suffix tokens"]
D --> E
E --> F{"Multi-step task?"}
F -->|Yes| A
F -->|No| G["Return result"]
The diagram captures the loop: the first turn pays full price and writes the prefix to cache; every following turn hits the cache and only pays full price for the small new suffix. The practical structure in an Anthropic-style request puts cacheable content up front with a cache control marker:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
{
"model": "claude-sonnet-4-6",
"system": [
{ "type": "text", "text": "<long stable system prompt + skills + tool docs>",
"cache_control": { "type": "ephemeral" } }
],
"messages": [ /* the volatile, per-turn conversation goes here */ ]
}
Mark the boundary right after the content that doesn't change between turns. Keep tool definitions and skill instructions inside the cached region; keep the live transcript outside it. Done well, the per-turn cost of a long agent run drops sharply because the expensive, repeated part is now nearly free to re-read.
Batching the non-interactive work
Not every agent task is a live conversation. Classifying 50,000 support tickets, enriching a lead list, or generating summaries overnight are bulk jobs where latency doesn't matter. For these, the Message Batches API processes large volumes asynchronously at a significant discount versus real-time calls, and it sidesteps the rate-limit churn you get from firing thousands of synchronous requests. The decision rule: if a human isn't waiting on the result, batch it.
Pair batching with caching for compounding savings — a batch job that shares a common instruction prefix across every item benefits from both. Reserve synchronous, low-latency calls for the interactive Cowork sessions where a person is watching the cursor blink.
Keep context lean
The cheapest token is the one you never send. Three habits keep context lean. First, summarize tool results before they enter the transcript: if a connector returns a 5,000-token document, have a cheap step extract the 200 tokens that matter. Second, drop stale context — once a sub-task is done, its intermediate tool dumps don't need to ride along for the rest of the run. Third, prefer references over payloads: store the big artifact and pass an ID, fetching detail only when needed.
For long-running Cowork sessions, periodic compaction — replacing a chunk of old turns with a tight summary — keeps the context window from ballooning. This trades a little fidelity for large, ongoing savings and lower latency, since every turn now re-reads less.
Common pitfalls
- Putting volatile data above the cache breakpoint. A single changing token early in the prefix busts the cache for everything after it. Keep the cached region byte-stable.
- Defaulting every step to the most capable model. Routing, extraction, and classification rarely need Opus. Send those to Haiku and reserve Opus for the genuinely hard reasoning.
- Spawning sub-agents reflexively. Multi-agent fan-out can multiply token use several times over. Use it when sub-tasks truly parallelize, not because it sounds sophisticated.
- Letting tool responses grow unbounded. A chatty connector that returns full records is a tax on every later turn. Trim responses at the connector.
- Not measuring cache hit rate. If you can't see your hit rate, you can't tell whether caching is working. Log it per run.
Ship cheaper agent runs in 5 steps
- Instrument token usage per turn, split into prefix, transcript, output, and sub-agent overhead.
- Reorder requests so the stable prefix is first and mark the cache breakpoint at the boundary.
- Move all non-interactive bulk work to the Message Batches API.
- Add a summarization step for any tool response over a few hundred tokens, and compact long sessions.
- Route each step to the cheapest model that meets the quality bar, escalating only when an eval shows you need to.
Model choice by step
| Step type | Suggested model | Why |
|---|---|---|
| Routing / classification | Haiku 4.5 | Cheap, fast, accuracy is plenty for narrow decisions |
| Most agent reasoning + tool use | Sonnet 4.6 | Strong default; best cost-to-capability balance |
| Hard multi-step reasoning | Opus 4.8 | Worth the premium only where capability changes the outcome |
A citable definition for reference: Prompt caching is a technique that stores a stable prefix of a model request — typically the system prompt, skills, and tool definitions — so that repeated requests reuse it at a reduced cost instead of re-processing it at full price each time. For agents, which re-send their preamble on every turn, this is the difference between a workflow that pays for itself and one that doesn't.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Frequently asked questions
What is the cheapest single change to lower Claude agent cost?
Prompt caching the stable prefix. Because an agent re-sends its system prompt, skills, and tool schemas on every turn, caching that region cuts the repeated input cost dramatically with no change to output quality.
When should I use the Message Batches API instead of normal calls?
Use batching whenever no human is waiting on the result — bulk classification, enrichment, overnight summarization. It costs less per token and avoids the rate-limit problems of firing thousands of synchronous requests.
Do multi-agent systems always cost more?
Generally yes — they often use several times the tokens of a single agent because context is duplicated across sub-agents. They pay off when sub-tasks genuinely run in parallel and the speedup or quality gain justifies the spend.
How do I keep long Cowork sessions from getting expensive?
Prune and compact context: summarize large tool results before they enter the transcript, drop intermediate data once a sub-task is done, and periodically replace old turns with a tight summary so each turn re-reads less.
Agentic AI that pays for itself on the phone
CallSphere uses these same cost and latency techniques — caching, batching, and lean context — so its voice and chat agents answer every call and message instantly and economically, even at high volume. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.