Cut Claude Agent Token Cost: Caching and Batching
Make self-hosted Claude managed agents cheap and fast with prompt caching, batched tool calls, model routing, and context hygiene. Concrete tuning steps.
A managed agent that works is exciting until the bill arrives. Autonomous Claude agents re-send context on every turn, fan out into subagents, and re-read the same files over and over — and each of those habits quietly multiplies token spend. A single multi-turn run that touches a few tools can consume several times the tokens of a one-shot completion, and a multi-agent run several times more again. The good news is that the biggest cost levers are mechanical: prompt caching, batching, model routing, and ruthless context hygiene. Pull them and the same agent gets both cheaper and noticeably faster, because fewer tokens means less time to first token and less to generate.
This post is about making self-hosted managed agents — Claude running in your sandbox, reaching tools over MCP — economical at scale. We'll quantify where the tokens actually go, then walk the four levers that move the number most, with copy-pasteable patterns.
Key takeaways
- Prompt caching is the single biggest lever. Cache the stable prefix — system prompt, tool definitions, long instructions — and cached reads cost a fraction of fresh input tokens.
- The agent loop re-bills context every turn. A 10-turn run pays for the system prompt 10 times unless you cache it.
- Batch independent work with parallel tool calls and the Message Batches API for non-interactive jobs.
- Route by difficulty: Haiku 4.5 for cheap classification and routing, Sonnet 4.6 for most work, Opus 4.8 only for the hard reasoning steps.
- Context hygiene compounds: trim tool results, summarize old turns, and don't paste whole files when a slice will do.
Where the tokens actually go
Before optimizing, instrument. Log tokens_in and tokens_out per turn, split into cached vs. uncached input. In almost every agent we've profiled, the dominant cost is input, not output — and within input, the dominant cost is the part that repeats every turn: the system prompt, the tool schemas, and the accumulating conversation history. An agent that emits 200 tokens of reasoning per turn but re-reads a 6,000-token system prompt and 4,000 tokens of tool definitions on every one of 12 turns is spending the vast majority of its budget on the same unchanging bytes.
That observation is the whole strategy. If the expensive thing is repeated, stable context, then caching it is the highest-ROI change you can make — and it requires no change to the agent's behavior.
Lever 1: prompt caching
Prompt caching lets you mark a stable prefix so Claude reuses it across calls instead of re-processing it. You place a cache breakpoint after the parts that don't change between turns — typically the system prompt and tool definitions — and the model reads them from cache on subsequent calls at a steep discount, with a much lower latency cost too.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
{
"model": "claude-sonnet-4-6",
"system": [
{ "type": "text", "text": "You are an ops agent..." },
{ "type": "text", "text": "<long tool playbook>",
"cache_control": { "type": "ephemeral" } }
],
"tools": [ /* stable tool defs */ ]
}
The rule of thumb: order context from most-stable to least-stable and put the cache breakpoint at the boundary. System prompt and tools first (cached), then the dynamic conversation. Because caches are short-lived, this pays off most when an agent makes many calls in quick succession — which is exactly what a managed-agent loop does. For a 12-turn run, caching the stable prefix can take total input cost from "prefix billed 12 times" to "prefix billed once, then cheap cache reads."
How cost flows through an agent run
flowchart TD
A["New task arrives"] --> B{"Stable prefix cached?"}
B -->|No| C["Pay full input for prompt + tools"]
B -->|Yes| D["Cheap cache read"]
C --> E["Pick model by difficulty"]
D --> E
E -->|Routing/simple| F["Haiku 4.5"]
E -->|Default| G["Sonnet 4.6"]
E -->|Hard reasoning| H["Opus 4.8"]
F --> I["Batch independent tool calls"]
G --> I
H --> I
The two cheapest decisions in the whole graph are the early ones: hitting cache instead of re-billing the prefix, and routing simple turns to a smaller model. Get those right before you touch anything else.
Lever 2: batch independent work
Agents waste time and tokens doing serially what could be done in parallel. Two patterns help. First, encourage parallel tool calls: when an agent needs three independent lookups, it can request all three in a single turn rather than three sequential turns, collapsing latency and avoiding three round-trips of context. Second, for non-interactive workloads — overnight enrichment, bulk classification, eval runs — use the Message Batches API, which processes large volumes asynchronously at a significant discount versus real-time calls. If a job doesn't need a human waiting on it, batch it.
The distinction matters: parallel tool calls optimize a single live run; batching optimizes throughput across many runs. A well-tuned agent platform uses both — interactive runs cache aggressively and parallelize tools, while background jobs flow through the batch endpoint.
Lever 3: route by difficulty
Not every step in an agent run needs the most capable model. Classification, routing, short extraction, and "which tool should I use" decisions run well on Haiku 4.5 at a fraction of the cost. Reserve Opus 4.8 for the genuinely hard reasoning — planning a multi-step migration, reconciling conflicting data, writing tricky code — and let Sonnet 4.6 handle the broad middle. A cheap router step that reads the task and dispatches to the right model can cut blended cost substantially while keeping quality where it matters, because you stop paying premium rates for trivial decisions.
Lever 4: context hygiene
Every token you carry forward gets re-billed on every subsequent turn, so trimming context has compounding value. Three habits matter most: trim tool results before feeding them back — a database query that returns 5,000 rows of JSON when the agent needs 3 fields is pure waste, so project only the needed columns at the MCP server. Summarize old turns once a conversation grows long instead of carrying the full transcript. And slice files — give the agent the relevant function, not the whole 2,000-line module. None of these change what the agent can do; they change what it has to pay to keep thinking.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Common pitfalls
- Caching nothing because "the prompt changes." Even if part changes every turn, the system prompt and tools usually don't — cache that prefix and keep dynamic content after the breakpoint.
- Using Opus 4.8 for everything. Premium model on routing decisions is the most common avoidable cost. Route by difficulty.
- Dumping raw tool output back into context. Unbounded tool results balloon every following turn. Project and truncate at the server.
- Running batchable jobs in real time. Overnight and bulk work belongs on the Message Batches API at a discount, not on the interactive path.
- Optimizing output tokens first. In most agents, input dominates. Measure before you tune; chasing terse outputs while ignoring a re-billed prefix saves little.
Trim agent cost in 5 steps
- Instrument per-turn cached/uncached input and output tokens; find the largest repeated chunk.
- Add a cache breakpoint after the system prompt and tool definitions; verify cache-read tokens appear in usage.
- Add a Haiku 4.5 router step that dispatches simple turns away from Sonnet/Opus.
- Project and truncate tool results at the MCP server so only needed fields return.
- Move non-interactive workloads to the Message Batches API and parallelize independent tool calls in live runs.
Lever comparison
| Lever | Best for | Effort | Typical impact |
|---|---|---|---|
| Prompt caching | Multi-turn loops with stable prefix | Low | Large |
| Model routing | Mixed easy/hard steps | Medium | Large |
| Batching | Non-interactive bulk jobs | Low | Medium-large |
| Context hygiene | Long-running conversations | Medium | Medium, compounding |
Frequently asked questions
How much can prompt caching actually save on an agent?
It depends on how repetitive the context is, but the dynamic is straightforward: cached input tokens cost a fraction of fresh ones, and an agent loop re-sends its stable prefix every turn. The more turns a run takes and the larger the fixed prompt and tool definitions, the larger the saving — multi-turn agents are the ideal case for caching.
When should I use the Message Batches API instead of live calls?
Use batching whenever no human is waiting on the result: nightly enrichment, bulk classification, large eval runs, backfills. It trades latency for a meaningful per-token discount. Keep interactive agent turns on the standard endpoint where responsiveness matters, and lean on caching and parallel tool calls there instead.
Does routing to Haiku hurt quality?
Only if you route hard work to it. The point is to match model to difficulty: Haiku 4.5 for routing, classification, and short extraction; Sonnet 4.6 for the default workload; Opus 4.8 for genuinely hard reasoning. A cheap router that reads the step and picks the model preserves quality on the parts that need it while cutting cost on the parts that don't.
Why is input cost usually bigger than output cost in agents?
Because the agent loop carries and re-sends context every turn — system prompt, tool schemas, and growing history — while output is typically a short burst of reasoning and a tool call. The repeated input dominates, which is why caching and context hygiene move the bill far more than terse outputs do.
Bringing agentic AI to your phone lines
CallSphere runs these same cost levers — caching, model routing, batching, context hygiene — behind voice and chat agents that answer every call and message and use tools mid-conversation, keeping each interaction fast and cheap at scale. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.