Cheaper, Faster Claude Agents: Caching and Batching
Reduce token cost and latency in Claude Managed Agents with prompt caching, batching over fan-out, lean context, and matching the model to each task.
The first production agent most teams ship is correct and ruinously expensive. It re-sends the same fifty-page system prompt on every turn, fans out subagents for tasks a single call could handle, and re-reads files it already has in context. The bill arrives, someone runs the numbers, and suddenly the conversation shifts from "does it work" to "can we afford to leave it on." Performance is not a polish step for Claude Managed Agents — it is the difference between a demo and a product.
Token cost and latency are two faces of the same coin: every token the model reads or writes costs money and time. This post is about the three levers that move both at once — prompt caching, request batching, and disciplined context management — and how to apply them without sacrificing the quality that made the agent worth building.
Key takeaways
- Prompt caching is the highest-leverage optimization: stable prefixes (system prompt, tool definitions) read from cache cost a fraction of full input tokens.
- Order your context for cache hits — static content first, dynamic content last — or you will invalidate the cache every turn.
- Batch independent work instead of fanning out subagents; a single call over a list often beats many parallel agents that each reload context.
- Multi-agent runs cost several times more tokens than single-agent — reserve them for genuinely parallelizable, high-value tasks.
- Match the model to the job: Haiku for routing and extraction, Sonnet for most work, Opus only where its reasoning earns the cost.
Where the tokens actually go
Before optimizing, find out where the spend lives. In a typical agent run, the dominant cost is not the model's output — it is the input read on every turn. An agent that takes fifteen tool-calling turns re-reads the entire growing conversation fifteen times. If your system prompt and tool definitions are large, you pay for them again on every single turn. This compounding re-read is why agents feel disproportionately expensive compared to a single chat completion.
The implication is counterintuitive: shaving output tokens barely matters, but shrinking and stabilizing the input you re-read on every turn matters enormously. The two biggest wins both target that re-read — caching it so the repeated portion is cheap, and trimming the context so there is less of it to re-read in the first place. Everything else is a rounding error by comparison.
Lever one: prompt caching
Prompt caching lets the model store a prefix of your request and reuse it on subsequent calls at a steep discount instead of reprocessing it from scratch. For an agent that sends the same system prompt and tool schemas on every turn, this is transformative — the expensive, unchanging part of every request becomes nearly free after the first read.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Agent turn begins"] --> B["Assemble request"]
B --> C["Static prefix: system prompt + tools"]
B --> D["Dynamic suffix: latest messages"]
C --> E{"Prefix in cache?"}
E -->|Yes| F["Read prefix from cache (cheap)"]
E -->|No| G["Process prefix in full (write cache)"]
F --> H["Process only the dynamic suffix"]
G --> H
H --> I["Model decides next action"]
The rule that makes or breaks caching is ordering. Caches match on an exact prefix, so anything that changes invalidates everything after it. Put the most stable content first — system instructions, then tool definitions, then long reference documents — and let the volatile content (the latest user message, the most recent tool result) come last. A single timestamp or a per-turn counter placed near the top of the prompt will silently destroy your cache hit rate and you will wonder why caching "does nothing."
Mark your cache breakpoints deliberately at the end of large, stable blocks. A practical layout for a Claude Managed Agent is: cache the system prompt block, cache the tool-definition block, optionally cache a large static knowledge document, and leave the running conversation uncached since it changes every turn. With that structure, a fifteen-turn run pays full price for the heavy prefix once and the discounted price fourteen times.
Lever two: batch instead of fan out
Multi-agent orchestration is powerful and seductive, but it is the most common source of runaway cost. Because each subagent carries its own context — its own copy of instructions and relevant background — a run that spawns several subagents can consume several times the tokens of a single well-structured agent. Sometimes that is worth it; often it is not.
The discipline is to ask whether the work is genuinely parallel and genuinely independent. Classifying a hundred support tickets is parallelizable, but it does not need a hundred subagents — it needs one call that processes a batch, or an asynchronous batch job if latency is not critical. Reserve true subagent fan-out for tasks where each branch does deep, divergent work that would pollute a shared context, such as researching three unrelated subsystems in parallel. If the subagents would all do the same kind of work over different data, batch it instead.
// Expensive: one agent invocation per item, full context each time
for (const ticket of tickets) {
await runAgent({ system: BIG_PROMPT, input: ticket }); // N full reads
}
// Cheaper: one call classifies the whole batch
await runAgent({
system: BIG_PROMPT, // 1 read
input: { task: "classify each ticket", tickets }
});
The second form reads the heavy prompt once and lets the model handle the list, often at a tiny fraction of the cost. When throughput rather than latency is the goal, push the whole job through an asynchronous batch interface and accept results later at a further discount.
Lever three: keep the context lean
Every token you carry forward is a token you re-read on every future turn, so context is a recurring cost, not a one-time one. Long-running agents must actively manage their working memory. The two main techniques are summarization — compacting old turns into a short synopsis once they are no longer needed verbatim — and externalization, where the agent writes intermediate results to a file or store and keeps only a pointer in context.
Externalization deserves emphasis because it is underused. If a subagent produces a large research document, the orchestrator does not need the full text in its context — it needs to know the document exists and where to find it. Passing a reference instead of the payload keeps the orchestrator's context small and its per-turn cost flat even as the work accumulates. This is also why returning a file path from a tool is often better than returning the file's contents.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Match the model to the task
Not every step needs your most capable model. A well-tuned agent routes work to the cheapest model that can do it reliably. Use a fast, inexpensive model for routing, simple extraction, and classification; use a mid-tier model for the bulk of tool-calling work; reserve the most capable model for the genuinely hard reasoning steps where its quality changes the outcome. A single agent can call different models for different subtasks.
| Workload | Suggested tier | Why |
|---|---|---|
| Routing, intent detection | Haiku-class | Cheap, fast, accuracy is sufficient |
| Extraction, classification, summarizing | Haiku / Sonnet | Structured tasks rarely need top reasoning |
| General tool-calling agent work | Sonnet-class | Best balance of cost and capability |
| Hard multi-step reasoning, planning | Opus-class | Quality gain justifies the premium |
Common pitfalls
- Putting volatile data at the top of the prompt. A timestamp or request ID near the start invalidates the cache for the entire request. Keep the prefix byte-for-byte stable.
- Fanning out subagents for uniform work. If every subagent does the same task over different rows, batch it into one call instead of paying for N context copies.
- Carrying the whole transcript forever. Summarize or externalize old turns; otherwise per-turn cost grows linearly with run length.
- Using your most expensive model for everything. Routing and extraction on a top-tier model is pure waste; tier your models per subtask.
- Optimizing output length while ignoring input re-reads. Input dominates agent cost; focus there first.
Cut agent cost in 5 steps
- Instrument a real run and attribute tokens to system prompt, tools, conversation, and output so you know where the money goes.
- Reorder the prompt: stable content first, volatile content last, and place cache breakpoints after the large stable blocks.
- Replace subagent fan-out with batched single calls wherever the work is uniform; reserve fan-out for divergent deep work.
- Add context management — summarize old turns and externalize large artifacts to references.
- Tier your models per subtask and re-measure cost and latency against the original baseline.
Frequently asked questions
How much can prompt caching actually save?
For agents with large, stable prefixes that get re-read across many turns, the cached portion costs a small fraction of the full input price. Since input re-reads dominate agent cost, the effective savings on a long, prefix-heavy run are substantial — often the single biggest lever available.
Is multi-agent always more expensive than single-agent?
Effectively yes on a per-task basis, because each subagent carries its own context, so multi-agent runs commonly use several times the tokens of a comparable single-agent run. The justification is not cost but capability: when subtasks are deep, divergent, and parallelizable, the extra spend buys quality and speed that a single agent cannot match.
Does caching hurt quality?
No. Caching changes how the prefix is processed for cost purposes, not what the model sees. The model reads the same content; you simply pay less to re-process the unchanged part. Quality is identical to an uncached request with the same context.
When should I use asynchronous batching?
Use it when throughput matters more than immediate response — nightly classification, bulk enrichment, large-scale evaluation. You submit many independent requests and collect results later at a further discount, which is ideal for offline agent workloads where a few minutes or hours of latency is acceptable.
Bringing agentic AI to your phone lines
CallSphere applies the same cost discipline — cached prefixes, batched work, and tiered models — so its voice and chat agents stay fast and affordable while answering every call, using tools mid-conversation, and booking work 24/7. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.