Skip to content
Agentic AI
Agentic AI7 min read0 views

Cutting Claude Agent Cost: Caching, Batching, Fast Runs

Token-cost engineering for Claude agents: prompt caching, batching, model routing, and context discipline to keep runs cheap and fast.

An agent that works is only half the battle. The other half shows up on the invoice. A single Claude Cowork plugin that re-reads a 30,000-token knowledge base on every turn, runs five subagents in parallel, and never reuses a cached prefix can cost more per task than the human work it replaced. Multiply that across an enterprise rollout and the economics quietly invert. Performance and token cost are not an afterthought you optimize later — they are a design constraint you build around from the first prototype.

The good news is that agent cost is highly compressible. Most expensive agents are expensive for boring, fixable reasons: they resend the same context every turn, they call the largest model for trivial steps, and they run sequentially when they could batch. This post walks through the levers that actually move the bill, roughly in order of impact.

Where the tokens actually go

Before optimizing, measure. Instrument every run to record input and output tokens per turn, and you will almost always find the cost is dominated by input tokens, not output. An agent that takes twelve turns to finish a task pays for its entire growing context twelve times. The system prompt, the tool definitions, the conversation history, and every tool result accumulate, and each new turn ships the whole pile back to the model.

This is why agent cost scales super-linearly with task length. A two-turn task is cheap; a twenty-turn task is not twenty times more expensive but closer to a hundred, because each turn carries a bigger payload. The single most valuable number to track is average context size per turn over a run. If it is ballooning, your cost problem is a context-management problem, and the fixes below target exactly that.

Prompt caching: the highest-leverage lever

Prompt caching is the first thing to reach for. Claude lets you cache stable prefixes — your system prompt, tool definitions, and large reference documents — so that on subsequent turns the model reads them from cache at a steep discount instead of reprocessing every token. For an agent whose first 25,000 tokens never change across a run, this alone can cut input cost dramatically.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["New agent turn"] --> B{"Stable prefix cached?"}
  B -->|Yes| C["Read prefix from cache (cheap)"]
  B -->|No| D["Process full prefix & write cache"]
  C --> E["Process only new turn tokens"]
  D --> E
  E --> F{"Context near limit?"}
  F -->|No| G["Send to Claude"]
  F -->|Yes| H["Compact old turns, then send"]
  H --> G

The structural rule is to order your context from most stable to least stable. Put unchanging material — system instructions, tool schemas, fixed knowledge — at the very top so the cached prefix stays valid as long as possible. Volatile content like the latest tool result goes last. If you interleave stable and volatile content, you fragment the cacheable region and lose most of the benefit. Caching is free money, but only if your context is laid out to earn it.

Model routing: stop using Opus for everything

The second lever is matching the model to the step. Claude's family spans Opus for the hardest reasoning, Sonnet for the balanced middle, and Haiku for fast, cheap, high-volume work. Teams default everything to the most capable model out of caution, then wonder why runs are slow and pricey.

A better pattern is tiered routing. Use a smaller, faster model for classification, extraction, routing, and simple tool-call decisions, and reserve the most capable model for genuine multi-step reasoning or final synthesis. In a multi-agent setup, the orchestrator might run on a strong model while narrow subagents run on a cheaper one. The savings compound: the cheap-model steps are usually the most frequent ones. Route by difficulty, not by habit, and re-check the routing as the cheaper models keep getting better at tasks that used to require the flagship.

Batching and parallelism done right

Batching attacks latency and throughput. When an agent needs to process many independent items — score fifty leads, summarize thirty documents — running them one message at a time is the slowest and not the cheapest path. If the work is not interactive, the Message Batches API processes large volumes asynchronously at a meaningful discount, ideal for overnight enrichment jobs and bulk classification a Cowork plugin might queue.

For interactive work, parallelism is the lever, but a careful one. Claude Code can run subagents concurrently, which collapses wall-clock time when tasks are truly independent. The catch is that multi-agent runs typically consume several times more tokens than a single agent, because each subagent carries its own context. Parallelize when the speedup justifies the token multiplier and the subtasks do not share state; keep it a single agent when the work is sequential or the coordination overhead would erase the gain.

Context discipline keeps long runs cheap

For long-running agents, the durable win is context discipline. Do not let history grow unbounded. Summarize and compact: once a sub-goal is complete, replace its verbose tool exchanges with a short note capturing the outcome. Strip large tool outputs down to the fields the agent actually needs — returning a 5,000-token API response when the agent needs three values is pure waste that every later turn re-pays for.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Treat the context window as a budget you actively manage, not a bucket you fill until it overflows. Trim tool results at the source, summarize completed phases, and keep only what future turns will reference. Token-cost optimization for Claude agents is the practice of caching stable context, routing each step to the cheapest capable model, batching independent work, and trimming the context window so long runs stay both fast and affordable.

Frequently asked questions

What is the single biggest way to cut Claude agent costs?

Prompt caching. Most agent cost is input tokens, and a large stable prefix — system prompt, tool definitions, reference docs — gets reprocessed on every turn unless cached. Caching that prefix and ordering your context most-stable-first can cut input cost dramatically with almost no behavior change.

Does running subagents in parallel save money?

It saves time, not money. Parallel subagents collapse wall-clock latency but typically use several times more tokens than a single agent because each carries its own context. Use parallelism when speed matters and subtasks are independent; stay single-agent when work is sequential.

When should I use a smaller Claude model?

For classification, extraction, routing, and simple tool decisions — high-frequency steps that do not need deep reasoning. Reserve the most capable model for genuine multi-step reasoning and final synthesis. Tiered routing saves the most because the cheap-model steps are usually the most common.

How do I keep a long-running agent from getting expensive?

Manage context actively. Trim large tool outputs to the fields you need, summarize and compact completed phases into short notes, and cap history growth. Cost scales super-linearly with run length because each turn re-pays for the whole accumulated context.

Lower-cost agentic AI on your phone lines

CallSphere applies this same cost-and-speed engineering — caching, smart model routing, lean context — to voice and chat agents that handle every call and message in real time without runaway bills. See it working at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.