Cheaper, Faster Claude Agents: Caching and Batching

The first production agent most teams ship is correct and ruinously expensive. It re-sends the same fifty-page system prompt on every turn, fans out subagents for tasks a single call could handle, and re-reads files it already has in context. The bill arrives, someone runs the numbers, and suddenly the conversation shifts from "does it work" to "can we afford to leave it on." Performance is not a polish step for Claude Managed Agents — it is the difference between a demo and a product.

Token cost and latency are two faces of the same coin: every token the model reads or writes costs money and time. This post is about the three levers that move both at once — prompt caching, request batching, and disciplined context management — and how to apply them without sacrificing the quality that made the agent worth building.

Key takeaways

Prompt caching is the highest-leverage optimization: stable prefixes (system prompt, tool definitions) read from cache cost a fraction of full input tokens.
Order your context for cache hits — static content first, dynamic content last — or you will invalidate the cache every turn.
Batch independent work instead of fanning out subagents; a single call over a list often beats many parallel agents that each reload context.
Multi-agent runs cost several times more tokens than single-agent — reserve them for genuinely parallelizable, high-value tasks.
Match the model to the job: Haiku for routing and extraction, Sonnet for most work, Opus only where its reasoning earns the cost.

Where the tokens actually go

Before optimizing, find out where the spend lives. In a typical agent run, the dominant cost is not the model's output — it is the input read on every turn. An agent that takes fifteen tool-calling turns re-reads the entire growing conversation fifteen times. If your system prompt and tool definitions are large, you pay for them again on every single turn. This compounding re-read is why agents feel disproportionately expensive compared to a single chat completion.

The implication is counterintuitive: shaving output tokens barely matters, but shrinking and stabilizing the input you re-read on every turn matters enormously. The two biggest wins both target that re-read — caching it so the repeated portion is cheap, and trimming the context so there is less of it to re-read in the first place. Everything else is a rounding error by comparison.

Lever one: prompt caching

Prompt caching lets the model store a prefix of your request and reuse it on subsequent calls at a steep discount instead of reprocessing it from scratch. For an agent that sends the same system prompt and tool schemas on every turn, this is transformative — the expensive, unchanging part of every request becomes nearly free after the first read.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Agent turn begins"] --> B["Assemble request"]
  B --> C["Static prefix: system prompt + tools"]
  B --> D["Dynamic suffix: latest messages"]
  C --> E{"Prefix in cache?"}
  E -->|Yes| F["Read prefix from cache (cheap)"]
  E -->|No| G["Process prefix in full (write cache)"]
  F --> H["Process only the dynamic suffix"]
  G --> H
  H --> I["Model decides next action"]

The rule that makes or breaks caching is ordering. Caches match on an exact prefix, so anything that changes invalidates everything after it. Put the most stable content first — system instructions, then tool definitions, then long reference documents — and let the volatile content (the latest user message, the most recent tool result) come last. A single timestamp or a per-turn counter placed near the top of the prompt will silently destroy your cache hit rate and you will wonder why caching "does nothing."

Mark your cache breakpoints deliberately at the end of large, stable blocks. A practical layout for a Claude Managed Agent is: cache the system prompt block, cache the tool-definition block, optionally cache a large static knowledge document, and leave the running conversation uncached since it changes every turn. With that structure, a fifteen-turn run pays full price for the heavy prefix once and the discounted price fourteen times.

Lever two: batch instead of fan out

Multi-agent orchestration is powerful and seductive, but it is the most common source of runaway cost. Because each subagent carries its own context — its own copy of instructions and relevant background — a run that spawns several subagents can consume several times the tokens of a single well-structured agent. Sometimes that is worth it; often it is not.

The discipline is to ask whether the work is genuinely parallel and genuinely independent. Classifying a hundred support tickets is parallelizable, but it does not need a hundred subagents — it needs one call that processes a batch, or an asynchronous batch job if latency is not critical. Reserve true subagent fan-out for tasks where each branch does deep, divergent work that would pollute a shared context, such as researching three unrelated subsystems in parallel. If the subagents would all do the same kind of work over different data, batch it instead.

// Expensive: one agent invocation per item, full context each time
for (const ticket of tickets) {
  await runAgent({ system: BIG_PROMPT, input: ticket });   // N full reads
}

// Cheaper: one call classifies the whole batch
await runAgent({
  system: BIG_PROMPT,                                       // 1 read
  input: { task: "classify each ticket", tickets }
});

The second form reads the heavy prompt once and lets the model handle the list, often at a tiny fraction of the cost. When throughput rather than latency is the goal, push the whole job through an asynchronous batch interface and accept results later at a further discount.

Lever three: keep the context lean

Every token you carry forward is a token you re-read on every future turn, so context is a recurring cost, not a one-time one. Long-running agents must actively manage their working memory. The two main techniques are summarization — compacting old turns into a short synopsis once they are no longer needed verbatim — and externalization, where the agent writes intermediate results to a file or store and keeps only a pointer in context.

Externalization deserves emphasis because it is underused. If a subagent produces a large research document, the orchestrator does not need the full text in its context — it needs to know the document exists and where to find it. Passing a reference instead of the payload keeps the orchestrator's context small and its per-turn cost flat even as the work accumulates. This is also why returning a file path from a tool is often better than returning the file's contents.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Match the model to the task

Not every step needs your most capable model. A well-tuned agent routes work to the cheapest model that can do it reliably. Use a fast, inexpensive model for routing, simple extraction, and classification; use a mid-tier model for the bulk of tool-calling work; reserve the most capable model for the genuinely hard reasoning steps where its quality changes the outcome. A single agent can call different models for different subtasks.

Workload	Suggested tier	Why
Routing, intent detection	Haiku-class	Cheap, fast, accuracy is sufficient
Extraction, classification, summarizing	Haiku / Sonnet	Structured tasks rarely need top reasoning
General tool-calling agent work	Sonnet-class	Best balance of cost and capability
Hard multi-step reasoning, planning	Opus-class	Quality gain justifies the premium

Common pitfalls

Putting volatile data at the top of the prompt. A timestamp or request ID near the start invalidates the cache for the entire request. Keep the prefix byte-for-byte stable.
Fanning out subagents for uniform work. If every subagent does the same task over different rows, batch it into one call instead of paying for N context copies.
Carrying the whole transcript forever. Summarize or externalize old turns; otherwise per-turn cost grows linearly with run length.
Using your most expensive model for everything. Routing and extraction on a top-tier model is pure waste; tier your models per subtask.
Optimizing output length while ignoring input re-reads. Input dominates agent cost; focus there first.

Cut agent cost in 5 steps

Instrument a real run and attribute tokens to system prompt, tools, conversation, and output so you know where the money goes.
Reorder the prompt: stable content first, volatile content last, and place cache breakpoints after the large stable blocks.
Replace subagent fan-out with batched single calls wherever the work is uniform; reserve fan-out for divergent deep work.
Add context management — summarize old turns and externalize large artifacts to references.
Tier your models per subtask and re-measure cost and latency against the original baseline.

Frequently asked questions

How much can prompt caching actually save?

For agents with large, stable prefixes that get re-read across many turns, the cached portion costs a small fraction of the full input price. Since input re-reads dominate agent cost, the effective savings on a long, prefix-heavy run are substantial — often the single biggest lever available.

Is multi-agent always more expensive than single-agent?

Effectively yes on a per-task basis, because each subagent carries its own context, so multi-agent runs commonly use several times the tokens of a comparable single-agent run. The justification is not cost but capability: when subtasks are deep, divergent, and parallelizable, the extra spend buys quality and speed that a single agent cannot match.

Does caching hurt quality?

No. Caching changes how the prefix is processed for cost purposes, not what the model sees. The model reads the same content; you simply pay less to re-process the unchanged part. Quality is identical to an uncached request with the same context.

When should I use asynchronous batching?

Use it when throughput matters more than immediate response — nightly classification, bulk enrichment, large-scale evaluation. You submit many independent requests and collect results later at a further discount, which is ideal for offline agent workloads where a few minutes or hours of latency is acceptable.

Bringing agentic AI to your phone lines

CallSphere applies the same cost discipline — cached prefixes, batched work, and tiered models — so its voice and chat agents stay fast and affordable while answering every call, using tools mid-conversation, and booking work 24/7. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Cheaper, Faster Claude Agents: Caching and Batching

Key takeaways

Where the tokens actually go

Lever one: prompt caching

Lever two: batch instead of fan out

Lever three: keep the context lean

Match the model to the task

Common pitfalls

Cut agent cost in 5 steps

Frequently asked questions

How much can prompt caching actually save?

Is multi-agent always more expensive than single-agent?

Does caching hurt quality?

When should I use asynchronous batching?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild