Cutting Claude agent costs: caching, batching, fast cheap runs
Slash Claude agent token cost and latency with prompt caching, model routing, batching, and lean context — without losing quality.
The first invoice from a production Claude agent is a rite of passage. An agent that felt nearly free in development suddenly costs real money at scale, because every step re-reads a long system prompt, re-loads tool definitions, and drags an ever-growing conversation history through the model. Multi-agent setups make it worse — an orchestrator with several subagents can use several times the tokens of a single agent doing the same job. The good news: most agent spend is waste, and waste is fixable. This post is about making Claude runs cheap and fast without sacrificing the quality that made you build the agent in the first place.
Key takeaways
- Prompt caching is the single biggest lever — cache your stable system prompt and tool definitions so you stop paying full price to re-read them every turn.
- Route by difficulty: use Haiku for routing and simple steps, Sonnet for the bulk, and Opus only where reasoning depth pays for itself.
- Batch independent, non-interactive work to cut per-request overhead and unlock lower throughput-oriented pricing.
- Context is a cost multiplier — trim history aggressively and summarize long-running sessions instead of carrying every message.
- Measure tokens per task, not per request; a cheaper-looking step that triggers ten extra turns is not cheaper.
Where the tokens actually go
Before optimizing, instrument. Most teams are shocked to learn that the bulk of their spend is not the user's question or the model's answer — it is the fixed overhead repeated on every single turn. A typical agent turn re-sends the full system prompt, the entire tool schema, and the complete prior conversation, then generates a short tool call. If your system prompt and tool definitions total a few thousand tokens and the agent takes fifteen turns, you have paid for that overhead fifteen times.
So the mental model is simple: agent cost is dominated by repeated input tokens, not output tokens. Once you see that, the optimization strategy almost writes itself — stop paying full price for the parts that don't change, and shrink the parts that grow.
It helps to attach a real dollar figure to a single representative task before and after each change, because intuition is a poor guide here. A step that looks cheap in isolation can be the one that triggers three extra tool round-trips, while a step that looks expensive can be the one that lets the agent finish in half the turns. Track the full task end to end, log the token breakdown per turn, and let the numbers — not a hunch — decide which lever to pull next. Without that instrumentation you will optimize the wrong thing and feel busy while the bill stays flat.
Lever 1: prompt caching
Prompt caching is the highest-return optimization available to a Claude agent. You mark the stable prefix of your request — system prompt, tool definitions, long reference material — as cacheable, and on subsequent calls Claude reads it from cache at a steep discount instead of reprocessing it from scratch. For an agent that loops many times over the same fixed instructions, this turns the dominant cost line into a rounding error.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Incoming agent turn"] --> B{"Stable prefix cached?"}
B -->|Yes| C["Read prefix from cache (cheap)"]
B -->|No| D["Process prefix fully (full price)"]
D --> E["Write prefix to cache"]
C --> F["Process only new tokens"]
E --> F
F --> G["Model responds"]
G --> H{"More turns?"}
H -->|Yes| A
H -->|No| I["Done"]The key discipline is ordering: put everything stable at the front and everything volatile at the back. Cache breaks at the first byte that changes, so if you interleave dynamic content into your system prompt, you lose the benefit. Here is the shape of a cached request using the Anthropic SDK:
messages.create(
model="claude-sonnet-4-6",
system=[{
"type": "text",
"text": LONG_STABLE_INSTRUCTIONS,
"cache_control": {"type": "ephemeral"}
}],
tools=TOOL_DEFS, # also stable — place before dynamic content
messages=conversation
)The cache_control marker tells Claude where the reusable prefix ends. For a long-running agent, this one annotation routinely removes the majority of input-token cost.
Lever 2: route to the right model
Not every step needs your most capable model. A well-designed agent treats model choice as a per-step decision. Use Haiku 4.5 for cheap, high-volume work — classifying intent, deciding which subagent to invoke, extracting a field. Use Sonnet 4.6 as the workhorse for most tool-using steps. Reserve Opus 4.8 for the genuinely hard reasoning where its depth changes the outcome. A common pattern is a Haiku router in front of a Sonnet executor, escalating to Opus only on flagged-hard tasks.
The mistake is using a single top-tier model for everything "to be safe." That is like sending a senior architect to every standup. Match the model to the cognitive load of the step and you often cut cost dramatically with no quality loss on the easy majority.
Lever 3: batch the non-interactive work
A lot of agentic work is not interactive at all — enriching ten thousand records, classifying a backlog, generating summaries overnight. For anything that does not need a real-time answer, batch it. Submitting work as a batch reduces per-request overhead and is priced for throughput rather than latency, which can roughly halve the cost of large offline jobs. The rule of thumb: if a human is not waiting on the result this second, it should probably be a batch.
Lever 4: keep context lean
The other cost that grows is conversation history. Every turn carries the full prior dialogue, so a long session pays for its own past over and over. Two tactics help. First, summarize: when a session crosses a length threshold, compress the older turns into a short running summary and drop the raw messages. Second, scope tightly — give each subagent only the slice of context it needs rather than the entire shared history. A subagent that summarizes a document does not need the orchestrator's full plan.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
A complementary tactic is to lower the cost of the history you do keep by being deliberate about what enters it in the first place. Verbose tool results are a silent killer: a tool that dumps a 5,000-token raw API response into the conversation makes every subsequent turn carry that payload. Have your tools return only the fields the agent actually needs, paginate large results, and push bulky artifacts to a store the agent can reference by ID rather than inlining them. The cheapest token is the one you never put in the context window.
Common pitfalls
- Caching nothing. Leaving prompt caching off is the most expensive mistake; it is also the easiest to fix.
- Volatile content in the cached prefix. A timestamp or session ID in your system prompt breaks the cache on every call. Keep the prefix byte-stable.
- One big model for everything. Routing simple steps to Haiku or Sonnet saves money the easy majority of the time.
- Optimizing per request, not per task. A cheaper step that causes extra turns can raise total cost. Measure end-to-end tokens per completed task.
- Unbounded multi-agent fan-out. Spawning subagents you don't need multiplies token use. Use multi-agent deliberately, not by default.
Cut your agent bill in 5 steps
- Instrument tokens per completed task, splitting input vs. output and overhead vs. payload.
- Move all stable instructions and tool definitions to the front and add prompt caching.
- Introduce model routing: Haiku for routing/simple, Sonnet for the bulk, Opus only where it pays.
- Move every non-interactive job to batch processing.
- Add context summarization for long sessions and scope subagent context tightly.
| Lever | Best for | Typical effect |
|---|---|---|
| Prompt caching | Looping agents with fixed prompts | Largest input-cost reduction |
| Model routing | Mixed-difficulty steps | Lower cost on the easy majority |
| Batching | Offline, non-interactive jobs | Throughput pricing, less overhead |
| Context trimming | Long-running sessions | Stops paying for old history |
Frequently asked questions
What is prompt caching and why does it matter for agents?
Prompt caching lets Claude reuse a previously processed, stable prefix of your request — system prompt, tool definitions, reference text — at a steep discount instead of reprocessing it every turn. For agents that loop over fixed instructions, it removes the dominant repeated-input cost and is usually the single biggest savings available.
When should I use a batch instead of a normal request?
Use batching whenever no human is waiting on the result in real time — bulk classification, enrichment, overnight summarization. Batches reduce per-request overhead and are priced for throughput, so large offline jobs cost meaningfully less than the same work run as individual interactive calls.
Does using a cheaper model always save money?
Not necessarily. A smaller model can fail a hard step and trigger extra retries or turns, raising total token use. Measure cost per completed task, not per request, and route by difficulty so the cheap model only handles work it can do reliably.
How do multi-agent systems affect cost?
A multi-agent system typically uses several times more tokens than a single agent doing the same job, because each subagent carries its own context and there is coordination overhead. Use multi-agent patterns deliberately for genuinely parallel or specialized work, and scope each subagent's context tightly.
Fast, affordable agents on the phone
Keeping runs cheap and quick is what makes always-on voice automation viable. CallSphere applies the same caching, routing, and lean-context discipline to voice and chat agents that answer every call and message at scale without a runaway bill. See it working at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.