Cut Claude Agent Cost: Caching, Batching, Fast Runs
Cut Claude agent cost and latency with prompt caching, batching, model routing, and context discipline — techniques for cheaper, faster runs.
Agentic systems are expensive in a way that single prompts are not. Every turn re-sends the entire conversation, so a 15-turn agent pays for its system prompt fifteen times. Multi-agent runs compound this — they typically burn several times more tokens than a single agent doing the same job. The good news is that the biggest wins in agent economics come from a handful of mechanical techniques, and the largest of them is prompt caching. This post is about making Claude agents genuinely cheap and fast in production, with concrete numbers on where the tokens go and how to claw them back.
Key takeaways
- Prompt caching is the single biggest lever: a cache read costs a fraction of a fresh input token, and agents re-read the same prefix every turn.
- Order your prompt static-to-dynamic: system prompt and tools first, conversation last, so the cacheable prefix stays stable.
- Batch independent, non-interactive work to halve cost on jobs that can tolerate latency.
- Route by difficulty — Haiku for cheap classification, Sonnet for most agent work, Opus only where capability pays for itself.
- Context discipline (summaries, pruning, retrieval) keeps the per-turn input from growing unboundedly.
Where the tokens actually go
Before optimizing, measure. In a typical tool-using agent, the input dwarfs the output: a few hundred output tokens per turn against thousands of input tokens that grow every turn. The drivers are the system prompt, the tool definitions, and the accumulating message history. Read the usage block on every response — input_tokens, cache_creation_input_tokens, cache_read_input_tokens, and output_tokens — and sum them across a full run. Most teams are shocked to find that 80% or more of their spend is re-reading a prefix that never changed.
That observation is the whole strategy. If the prefix is stable and you are paying full price for it on every turn, you are leaving most of your budget on the table. Prompt caching exists precisely to fix this.
Prompt caching: the highest-leverage move
Prompt caching lets Claude store a prefix of your prompt and reuse it across calls. A cache write costs slightly more than a normal input token (you pay a premium to store), but every subsequent cache read costs a small fraction of the normal input price. For an agent that re-sends a 4,000-token system-plus-tools prefix on every one of fifteen turns, caching that prefix turns fourteen full-price reads into fourteen cheap ones.
A practical definition: prompt caching is a mechanism that stores a fixed prefix of a prompt so repeated requests reuse it at a reduced read cost instead of reprocessing it from scratch. The key constraint is that the cached portion must be byte-identical across calls, and it must be a prefix — everything before your cache breakpoint.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Incoming agent turn"] --> B{"Prefix unchanged & cached?"}
B -->|Yes| C["Cache READ — fraction of input price"]
B -->|No| D["Cache WRITE — small premium, stores prefix"]
C --> E["Process only the new tail tokens"]
D --> E
E --> F{"Run interactive?"}
F -->|No| G["Send via batch — extra discount"]
F -->|Yes| H["Return inline"]
To use it, mark a cache breakpoint after your stable content. The order matters: put the things that never change first.
{
"model": "claude-sonnet-4-6",
"system": [
{ "type": "text", "text": "<long stable instructions>",
"cache_control": { "type": "ephemeral" } }
],
"tools": [ /* stable tool defs — also above the breakpoint */ ],
"messages": [ /* dynamic conversation — below the breakpoint */ ]
}
The cache has a short default lifetime that refreshes on each hit, so a busy agent keeps its cache warm naturally. Idle agents may need a longer cache window if your provider tier supports one. The cardinal rule: never let a dynamic value — a timestamp, a user ID, a per-request note — sneak above the breakpoint, or you convert every read into a write.
Batching: discount for work that can wait
Not every agent call needs to answer in real time. Overnight enrichment, bulk classification, evaluation runs, and document processing can go through the Message Batches API, which trades latency for a meaningful per-token discount. Batching composes with caching: a nightly job that processes ten thousand records against the same cached instruction prefix gets both the batch discount and the cache discount. The rule of thumb is simple — if a human is not waiting on the response, batch it.
Model routing: stop paying Opus prices for Haiku work
Using your most capable model for everything is the most common overspend. Claude's lineup is tiered for a reason: Haiku 4.5 is fast and cheap for classification, extraction, and routing; Sonnet 4.6 handles the bulk of real agent reasoning at a strong price-performance point; Opus 4.8 is reserved for the hardest planning and synthesis. A clean pattern is a cheap router model that triages each request and dispatches to the right tier. Many teams find that a Haiku pre-classifier plus Sonnet execution handles the large majority of traffic, with Opus invoked only for the genuinely hard minority.
Context discipline: keep the per-turn input flat
Even with caching, the dynamic tail of your conversation grows every turn, and you pay full price for that growth. Three habits keep it in check. First, summarize: when the message history crosses a threshold, replace old turns with a compact summary the agent can still reason over. Second, prune tool results: a search tool that returns 50 KB of JSON should be trimmed to the fields the agent actually needs before you append it. Third, retrieve instead of stuff: rather than pasting a whole document into context, give the agent a tool to fetch the relevant slice on demand.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Common pitfalls
- Putting a timestamp in the system prompt. It changes every call and silently disables caching. Move volatile values into the message body.
- Caching tiny prefixes. The cache write premium only pays off if the prefix is large enough and reused enough. Cache substantial, stable content, not a 200-token header.
- Batching latency-sensitive calls. Putting an interactive chat turn through the batch API destroys the user experience for a discount you should not want there.
- Defaulting everything to Opus. Capability you do not need is just cost. Profile which requests actually require it.
- Ignoring output tokens in multi-agent fan-out. Each subagent both reads context and writes a report; ten subagents multiply both. Spawn deliberately.
Make an agent cheap in 5 steps
- Instrument every call and sum the four token counters across a full run to find your real cost drivers.
- Reorder the prompt static-to-dynamic and place a cache breakpoint after the stable system-plus-tools block.
- Confirm cache hits by watching
cache_read_input_tokensclimb andinput_tokensfall on repeat turns. - Route by difficulty: add a Haiku triage step and reserve Opus for the hard minority.
- Move non-interactive jobs to the Batches API and add summarization plus result-pruning to flatten the dynamic tail.
| Lever | Best for | Typical effect |
|---|---|---|
| Prompt caching | Any multi-turn or repeated-prefix workload | Largest single saving on input cost |
| Batching | Non-interactive bulk jobs | Per-token discount, higher latency |
| Model routing | Mixed-difficulty traffic | Cut cost on the easy majority |
| Context pruning | Long conversations, fat tool results | Keeps per-turn input from ballooning |
Frequently asked questions
How much does prompt caching actually save?
It depends on reuse, but the structure of the saving is dramatic: a cache read costs only a fraction of a normal input token, and an agent re-reads its prefix on every turn. For a long-running agent with a large stable prefix, caching commonly removes the majority of input cost. Measure your own ratio with cache_read_input_tokens over total input.
Does prompt caching change the model's output?
No. Caching only changes how the prefix is processed and billed; the model sees the same tokens and produces the same quality of response. It is a pure cost-and-latency optimization, not a behavior change.
When should I batch instead of streaming?
Batch whenever no human is waiting on the result — nightly enrichment, bulk classification, and eval runs. Keep interactive chat and anything user-facing on the standard real-time path, since batching trades latency for cost.
Is multi-agent always more expensive?
Generally yes — multi-agent runs use several times more tokens than a single agent because each subagent reads context and writes output. Use them when the parallelism or specialization genuinely improves the result, and cache the shared instructions so every subagent reads them cheaply.
Bringing agentic AI to your phone lines
CallSphere runs these cost and latency techniques under the hood so voice and chat agents stay fast and affordable at scale — cached prefixes, tiered models, and tight context, answering every call 24/7. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.