Skip to content
Agentic AI
Agentic AI7 min read0 views

Cutting Claude Agent Token Cost: Caching & Batching

Make Claude agents cheap and fast with prompt caching, batching, context pruning, and smart model routing across Opus, Sonnet, and Haiku.

The first time you put a Claude agent in front of real traffic, the bill teaches you something the demo never did: agents are token-hungry. A single multi-turn run can replay the entire system prompt, every tool definition, and the growing transcript on each turn — and a multi-agent run multiplies that several times over. Left unoptimized, a useful agent can cost more per task than the human it was meant to assist. The good news is that most of that cost is waste, and the techniques to remove it are well understood. This post is a practical guide to keeping Claude agent runs cheap and fast without dumbing them down.

The mental model to hold throughout: every token in the context window is paid for on every turn it survives. Performance and cost are two views of the same thing — fewer tokens processed means lower spend and lower latency at once. So the work divides into three questions: how do we avoid re-paying for tokens that don't change, how do we keep the context window from bloating, and how do we route each unit of work to the cheapest model that can do it.

Prompt caching: stop paying for the same prefix

The largest single lever is prompt caching. In an agent loop, the system prompt, the tool definitions, and any long reference material are identical on every turn — yet a naive implementation re-sends and re-processes them each time. Prompt caching lets you mark a stable prefix so the model reuses the already-processed version on subsequent calls, charging a small fraction of the normal input rate for the cached portion. For an agent that runs ten or twenty turns over a large fixed prompt, this routinely cuts input cost by a wide margin.

The practical rule is to order your context from most stable to most volatile: put the system prompt and tool schemas first (cache them), then long-lived task context, then the turn-by-turn transcript that changes constantly. Cache the boundary as far down the stable region as you can. The same idea applies to skills and large documents — if a reference file is consulted across many turns, caching its tokens turns a recurring cost into a near-free lookup.

flowchart TD
  A["Incoming task"] --> B{"Prefix in cache?"}
  B -->|Yes| C["Reuse cached prompt & tools"]
  B -->|No| D["Process full prefix, write cache"]
  C --> E{"Simple or hard task?"}
  D --> E
  E -->|Simple| F["Route to Haiku"]
  E -->|Hard| G["Route to Sonnet / Opus"]
  F --> H["Run loop, prune stale context"]
  G --> H
  H --> I["Return result + log token cost"]

Batching independent work

When you have many similar, independent tasks — classify a thousand support tickets, summarize five hundred documents — running them one synchronous request at a time wastes both money and wall-clock time. Batching submits the whole set as a group for asynchronous processing, which is offered at a meaningful discount over real-time calls and removes per-request overhead. The trade is latency: batch results come back over a window rather than instantly, so batch the work that can tolerate it and reserve synchronous calls for interactive paths.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Inside an agentic system, batching also applies to parallel subagents. When an orchestrator fans out independent subtasks to several subagents, run them concurrently rather than sequentially. You'll still pay for the tokens, but you collapse the latency and you can route each subagent to an appropriately sized model. The discipline is to only fan out when subtasks are genuinely independent — parallel agents that need each other's output serialize anyway and just add coordination cost.

Keeping the context window lean

Context bloat is the silent cost multiplier. Every tool result the agent accumulates stays in the window and gets re-billed each turn unless you prune it. A long run that dumps full API responses into context can balloon to tens of thousands of tokens, slowing every subsequent turn. Three habits keep it lean. First, have tools return only what the agent needs — a summarized or field-filtered result instead of a raw payload. Second, compact the transcript: once a sub-goal is done, replace its verbose tool exchanges with a short summary of the outcome. Third, externalize memory — write large intermediate results to a file or store and keep only a reference in context, letting the agent re-read on demand.

Claude Code's large context window is a convenience, not a license to fill it. Treat context as a scarce, metered resource. A well-run agent keeps the window focused on what's relevant to the current step, which improves both cost and answer quality, since a smaller, cleaner context yields sharper reasoning.

Routing work to the right model

Not every step needs your most capable model. The Claude family spans Opus for the hardest reasoning, Sonnet for the balanced default, and Haiku for fast, cheap, high-volume work. A cost-aware agent routes: use Haiku for classification, extraction, routing decisions, and simple tool-call formatting; reserve Sonnet or Opus for genuine planning, ambiguous reasoning, and synthesis. In a multi-agent setup the orchestrator can run on a stronger model while many narrow subagents run on Haiku.

Routing pays off most when you measure it. Tag each model call with its purpose and token count, then look at where the spend actually goes. Teams are often surprised that a cheap, high-frequency step dominates the bill while the expensive Opus calls are rare. Once you can see the distribution, you can move the high-frequency steps down a model tier and reclaim most of the cost with no quality loss.

Measuring cost like a first-class metric

You cannot optimize what you don't measure. Instrument every run to emit tokens in, tokens out, cached tokens, model used, and turn count, then aggregate cost per task type. Set a budget per task and alert when a run exceeds it — a single runaway loop can cost more than a thousand normal runs, so a cost cap doubles as a reliability guard. Track cost-per-successful-task, not just raw spend, so that a cheaper configuration that fails more often doesn't look like a win.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Put together, these techniques compound: caching removes the fixed-prefix tax, batching discounts the bulk work, context pruning shrinks every turn, and model routing right-sizes each call. Most teams find they can cut agent cost substantially while making runs faster, because the same tokens drive both bills.

Frequently asked questions

How much can prompt caching save on a Claude agent?

Savings scale with how much of your context is stable and how many turns reuse it. An agent with a large fixed system prompt and tool set running many turns can cut input cost dramatically, because cached tokens are billed at a small fraction of the normal input rate. Order context stable-first to maximize the cached prefix.

When should I use the Batch API instead of real-time calls?

Use batching for large volumes of independent tasks that can tolerate results arriving over a window rather than instantly — bulk classification, summarization, or enrichment. It's offered at a discount and removes per-request overhead. Keep interactive, latency-sensitive paths on synchronous calls.

Why do multi-agent runs cost so much more?

Each subagent carries its own context and turns, so a multi-agent run typically uses several times the tokens of a single agent. Use multi-agent patterns deliberately for genuinely parallel or specialized work, route narrow subagents to Haiku, and only fan out when subtasks are truly independent.

What's the simplest first optimization to make?

Turn on prompt caching for your stable prefix and start logging tokens per run. Those two steps reveal where the cost actually lives and capture the biggest, easiest saving before you touch anything else.

Bringing agentic AI to your phone lines

Cheap, fast agent runs matter even more in real time — a voice caller won't wait while you re-process the same prompt every turn. CallSphere applies caching, model routing, and lean context to voice and chat agents that answer every call and message and book work 24/7. See it live at callsphere.ai.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.