Cut Claude Agent Cost: Caching, Batching, Fast Runs

Agentic systems are expensive in a way that single prompts are not. Every turn re-sends the entire conversation, so a 15-turn agent pays for its system prompt fifteen times. Multi-agent runs compound this — they typically burn several times more tokens than a single agent doing the same job. The good news is that the biggest wins in agent economics come from a handful of mechanical techniques, and the largest of them is prompt caching. This post is about making Claude agents genuinely cheap and fast in production, with concrete numbers on where the tokens go and how to claw them back.

Key takeaways

Prompt caching is the single biggest lever: a cache read costs a fraction of a fresh input token, and agents re-read the same prefix every turn.
Order your prompt static-to-dynamic: system prompt and tools first, conversation last, so the cacheable prefix stays stable.
Batch independent, non-interactive work to halve cost on jobs that can tolerate latency.
Route by difficulty — Haiku for cheap classification, Sonnet for most agent work, Opus only where capability pays for itself.
Context discipline (summaries, pruning, retrieval) keeps the per-turn input from growing unboundedly.

Where the tokens actually go

Before optimizing, measure. In a typical tool-using agent, the input dwarfs the output: a few hundred output tokens per turn against thousands of input tokens that grow every turn. The drivers are the system prompt, the tool definitions, and the accumulating message history. Read the usage block on every response — input_tokens, cache_creation_input_tokens, cache_read_input_tokens, and output_tokens — and sum them across a full run. Most teams are shocked to find that 80% or more of their spend is re-reading a prefix that never changed.

That observation is the whole strategy. If the prefix is stable and you are paying full price for it on every turn, you are leaving most of your budget on the table. Prompt caching exists precisely to fix this.

Prompt caching: the highest-leverage move

Prompt caching lets Claude store a prefix of your prompt and reuse it across calls. A cache write costs slightly more than a normal input token (you pay a premium to store), but every subsequent cache read costs a small fraction of the normal input price. For an agent that re-sends a 4,000-token system-plus-tools prefix on every one of fifteen turns, caching that prefix turns fourteen full-price reads into fourteen cheap ones.

A practical definition: prompt caching is a mechanism that stores a fixed prefix of a prompt so repeated requests reuse it at a reduced read cost instead of reprocessing it from scratch. The key constraint is that the cached portion must be byte-identical across calls, and it must be a prefix — everything before your cache breakpoint.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Incoming agent turn"] --> B{"Prefix unchanged & cached?"}
  B -->|Yes| C["Cache READ — fraction of input price"]
  B -->|No| D["Cache WRITE — small premium, stores prefix"]
  C --> E["Process only the new tail tokens"]
  D --> E
  E --> F{"Run interactive?"}
  F -->|No| G["Send via batch — extra discount"]
  F -->|Yes| H["Return inline"]

To use it, mark a cache breakpoint after your stable content. The order matters: put the things that never change first.

{
  "model": "claude-sonnet-4-6",
  "system": [
    { "type": "text", "text": "<long stable instructions>",
      "cache_control": { "type": "ephemeral" } }
  ],
  "tools": [ /* stable tool defs — also above the breakpoint */ ],
  "messages": [ /* dynamic conversation — below the breakpoint */ ]
}

The cache has a short default lifetime that refreshes on each hit, so a busy agent keeps its cache warm naturally. Idle agents may need a longer cache window if your provider tier supports one. The cardinal rule: never let a dynamic value — a timestamp, a user ID, a per-request note — sneak above the breakpoint, or you convert every read into a write.

Batching: discount for work that can wait

Not every agent call needs to answer in real time. Overnight enrichment, bulk classification, evaluation runs, and document processing can go through the Message Batches API, which trades latency for a meaningful per-token discount. Batching composes with caching: a nightly job that processes ten thousand records against the same cached instruction prefix gets both the batch discount and the cache discount. The rule of thumb is simple — if a human is not waiting on the response, batch it.

Model routing: stop paying Opus prices for Haiku work

Using your most capable model for everything is the most common overspend. Claude's lineup is tiered for a reason: Haiku 4.5 is fast and cheap for classification, extraction, and routing; Sonnet 4.6 handles the bulk of real agent reasoning at a strong price-performance point; Opus 4.8 is reserved for the hardest planning and synthesis. A clean pattern is a cheap router model that triages each request and dispatches to the right tier. Many teams find that a Haiku pre-classifier plus Sonnet execution handles the large majority of traffic, with Opus invoked only for the genuinely hard minority.

Context discipline: keep the per-turn input flat

Even with caching, the dynamic tail of your conversation grows every turn, and you pay full price for that growth. Three habits keep it in check. First, summarize: when the message history crosses a threshold, replace old turns with a compact summary the agent can still reason over. Second, prune tool results: a search tool that returns 50 KB of JSON should be trimmed to the fields the agent actually needs before you append it. Third, retrieve instead of stuff: rather than pasting a whole document into context, give the agent a tool to fetch the relevant slice on demand.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Common pitfalls

Putting a timestamp in the system prompt. It changes every call and silently disables caching. Move volatile values into the message body.
Caching tiny prefixes. The cache write premium only pays off if the prefix is large enough and reused enough. Cache substantial, stable content, not a 200-token header.
Batching latency-sensitive calls. Putting an interactive chat turn through the batch API destroys the user experience for a discount you should not want there.
Defaulting everything to Opus. Capability you do not need is just cost. Profile which requests actually require it.
Ignoring output tokens in multi-agent fan-out. Each subagent both reads context and writes a report; ten subagents multiply both. Spawn deliberately.

Make an agent cheap in 5 steps

Instrument every call and sum the four token counters across a full run to find your real cost drivers.
Reorder the prompt static-to-dynamic and place a cache breakpoint after the stable system-plus-tools block.
Confirm cache hits by watching cache_read_input_tokens climb and input_tokens fall on repeat turns.
Route by difficulty: add a Haiku triage step and reserve Opus for the hard minority.
Move non-interactive jobs to the Batches API and add summarization plus result-pruning to flatten the dynamic tail.

Lever	Best for	Typical effect
Prompt caching	Any multi-turn or repeated-prefix workload	Largest single saving on input cost
Batching	Non-interactive bulk jobs	Per-token discount, higher latency
Model routing	Mixed-difficulty traffic	Cut cost on the easy majority
Context pruning	Long conversations, fat tool results	Keeps per-turn input from ballooning

Frequently asked questions

How much does prompt caching actually save?

It depends on reuse, but the structure of the saving is dramatic: a cache read costs only a fraction of a normal input token, and an agent re-reads its prefix on every turn. For a long-running agent with a large stable prefix, caching commonly removes the majority of input cost. Measure your own ratio with cache_read_input_tokens over total input.

Does prompt caching change the model's output?

No. Caching only changes how the prefix is processed and billed; the model sees the same tokens and produces the same quality of response. It is a pure cost-and-latency optimization, not a behavior change.

When should I batch instead of streaming?

Batch whenever no human is waiting on the result — nightly enrichment, bulk classification, and eval runs. Keep interactive chat and anything user-facing on the standard real-time path, since batching trades latency for cost.

Is multi-agent always more expensive?

Generally yes — multi-agent runs use several times more tokens than a single agent because each subagent reads context and writes output. Use them when the parallelism or specialization genuinely improves the result, and cache the shared instructions so every subagent reads them cheaply.

Bringing agentic AI to your phone lines

CallSphere runs these cost and latency techniques under the hood so voice and chat agents stay fast and affordable at scale — cached prefixes, tiered models, and tight context, answering every call 24/7. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Cut Claude Agent Cost: Caching, Batching, Fast Runs

Key takeaways

Where the tokens actually go

Prompt caching: the highest-leverage move

Batching: discount for work that can wait

Model routing: stop paying Opus prices for Haiku work

Context discipline: keep the per-turn input flat

Common pitfalls

Make an agent cheap in 5 steps

Frequently asked questions

How much does prompt caching actually save?

Does prompt caching change the model's output?

When should I batch instead of streaming?

Is multi-agent always more expensive?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild