Cheaper, Faster Claude Agents: Caching & Batching
Cut Claude agent cost and latency with prompt caching, batching, and model routing. Practical levers, code, and a five-step plan to keep runs cheap.
One of the quieter signals in the Anthropic Economic Index is volume. As Claude moves from novelty to a tool people reach for during the workday, the number of agent runs per team climbs steeply — and so does the bill. A single multi-agent run can consume several times the tokens of a one-shot prompt, and at organizational scale those multipliers stop being a rounding error. Performance engineering for agents is no longer optional; it's the difference between an agent that ships and one that gets switched off because finance flagged the spend.
The good news is that most agent cost is wasted, not essential. The same context gets re-sent every turn, tasks that could run in parallel run in sequence, and Opus does work Haiku could handle. This post walks through the levers that actually move the needle — prompt caching, request batching, model routing, and context discipline — with concrete numbers where they're honest.
Key takeaways
- Agent cost is dominated by re-sent context: the system prompt and tool definitions get paid for on every single turn unless you cache them.
- Prompt caching can cut input cost dramatically for the stable prefix of an agent — the part that never changes between turns.
- Batching independent tasks (or subagent calls) cuts wall-clock latency and can lower per-request cost on batch endpoints.
- Model routing — Haiku for cheap classification, Sonnet for most work, Opus for the hard reasoning — is the biggest single cost win for most teams.
- Trim the context: long, growing transcripts are both slower and more expensive every turn.
Where the tokens actually go
Engineers tend to assume the model's output is the expensive part. In agents, it usually isn't. The expensive part is input, and specifically the input you re-send every turn. A Claude agent's context typically holds a large system prompt, a block of tool definitions, accumulated conversation history, and the latest tool result. On turn 12 of a run, you are re-sending the system prompt and tool schemas for the twelfth time.
Because pricing is per token of input, that stable prefix dominates. The fix is structural: keep the unchanging parts of your prompt at the front, keep the volatile parts at the back, and let caching pay for the prefix once instead of every turn.
The Economic Index's picture of high-frequency, repeated use makes this worse, not better — the same agent runs hundreds of times a day with nearly identical prefixes. That repetition is precisely what caching is built to exploit.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
It's worth building an intuition for the arithmetic. Suppose your system prompt and tool definitions total a few thousand tokens, and a typical run takes a dozen turns. Without caching, you pay for that fixed block roughly twelve times per run; with caching you pay nearly full price once and a small fraction thereafter. Multiply that by hundreds of runs a day and the fixed prefix is, for many agents, the majority of the total input bill. This is why "the model's answers are short, so cost should be low" is such a common and expensive misconception — the output is cheap, but the re-sent input is where the money quietly goes.
The cost-control decision flow
flowchart TD
A["Incoming agent task"] --> B{"Stable prefix reused?"}
B -->|Yes| C["Cache system prompt & tools"]
B -->|No| D["Send full prompt once"]
C --> E{"Tasks independent?"}
D --> E
E -->|Yes| F["Batch in parallel"]
E -->|No| G["Run sequentially"]
F --> H{"How hard is the step?"}
G --> H
H -->|Simple| I["Route to Haiku"]
H -->|Standard| J["Route to Sonnet"]
H -->|Hard reasoning| K["Route to Opus"]Prompt caching: pay for the prefix once
Prompt caching lets you mark a stable prefix of your prompt so Claude reuses it across calls instead of reprocessing it. For an agent, the natural cache boundary is everything before the conversation: the system prompt, the persona, the rules, and the full block of tool definitions. Those don't change between turns, so they should be cached and the per-turn delta should be tiny.
Here's the shape of a cached agent request using the Anthropic API's cache control. The cache_control marker tells Claude the preceding content is a reusable prefix.
{
"model": "claude-sonnet-4-6",
"system": [
{
"type": "text",
"text": "<agent rules + all tool docs here>",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [
{ "role": "user", "content": "latest turn only" }
]
}The discipline that makes this work: never interleave volatile content into the cached block. If you splice the current timestamp or a per-turn variable into the middle of your system prompt, you bust the cache every time. Keep volatile values in the messages, not the prefix.
Batching and model routing
When an orchestrator fans out work to several subagents — say, researching five sources — those calls are independent and should run concurrently, not one after another. Parallel execution collapses wall-clock time from the sum of the calls to the slowest single call. For non-interactive bulk jobs, the Message Batches API trades immediacy for a lower per-request rate, which is ideal for overnight evals or large content runs.
Model routing is the other big lever, and it's often the largest. Not every step needs your most capable model. A cheap, fast model can classify intent, extract a field, or decide which subagent to invoke; reserve the expensive model for the steps that genuinely require deep reasoning. A simple router that sends easy steps to Haiku, default work to Sonnet, and only the hard reasoning to Opus routinely cuts cost without a noticeable quality drop.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
A practical way to find your routing boundaries is to start everything on Sonnet, then watch the trace. Steps that are obviously mechanical — "is this a refund request, yes or no?" or "extract the order ID" — are candidates to demote to Haiku. Steps where Sonnet visibly struggles, backtracks, or produces shallow plans are candidates to promote to Opus. You don't guess the routing table up front; you derive it from real traffic. And because the difficulty of a step is often knowable before you call the model — from the task type or a quick classifier — the router can usually pick the right tier in a single cheap upfront decision rather than discovering it the hard way mid-run.
Common pitfalls
- Busting the cache with volatile prefixes. Injecting a timestamp or session ID into the cached system block invalidates it on every call. Keep the prefix byte-stable.
- Letting context grow unbounded. A transcript that keeps appending makes every subsequent turn slower and pricier. Summarize or prune old turns.
- Using Opus for everything. Defaulting your whole pipeline to the most capable model is the most common source of avoidable spend.
- Sequential subagents. Running independent subagent calls one at a time wastes wall-clock time you could reclaim for free with parallelism.
- No per-run token budget. Without a ceiling, a single misbehaving agent can quietly become your biggest line item.
Ship cheaper agents in five steps
- Measure first: log input vs. output tokens per turn so you know where the money actually goes.
- Move all stable content (system prompt, tool docs) to a cached prefix and keep it byte-stable.
- Route by difficulty — Haiku for trivial steps, Sonnet for the default, Opus only for hard reasoning.
- Parallelize independent subagent calls and move bulk jobs to the batch endpoint.
- Cap each run with a token budget and prune context once it crosses a threshold.
What each lever buys you
| Lever | Cuts cost? | Cuts latency? | Best for |
|---|---|---|---|
| Prompt caching | Yes (input) | Yes | Long, stable prefixes reused every turn |
| Batching | Sometimes | Yes | Independent or bulk non-interactive tasks |
| Model routing | Yes (large) | Yes | Pipelines mixing easy and hard steps |
| Context pruning | Yes | Yes | Long-running multi-turn agents |
Frequently asked questions
What is prompt caching in Claude?
Prompt caching is a feature that lets Claude reuse a marked, stable prefix of your prompt across requests instead of reprocessing it each time. For agents, you cache the system prompt and tool definitions so you pay full price for them once rather than on every turn.
Why do multi-agent systems cost more?
A multi-agent run uses several times the tokens of a single-agent run because each subagent carries its own context and the orchestrator coordinates across them. Use multi-agent designs deliberately, and parallelize plus cache aggressively to keep the multiplier in check.
When should I use the batch API over real-time calls?
Use batching for non-interactive bulk work — overnight evals, large content generation, mass classification — where you can tolerate higher latency in exchange for a lower per-request rate. Keep interactive, user-facing turns on the standard real-time endpoint.
How do I pick which model to route to?
Match model capability to step difficulty: Haiku for cheap classification and extraction, Sonnet for most agent work, Opus for genuinely hard reasoning or long-horizon planning. Routing by difficulty is usually the single biggest cost win.
Agentic voice that scales without the runaway bill
The caching, routing, and batching discipline that keeps a Claude pipeline cheap is exactly what makes high-volume voice automation viable. CallSphere brings these agentic-AI patterns to voice and chat — fast, tool-using assistants that answer every call and message affordably at scale. Try it at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.