Skip to content
Agentic AI
Agentic AI9 min read0 views

Cutting Claude Agent Token Costs: Caching & Batching (Cowork Enterprise Ready)

Make Claude agents cheap and fast with prompt caching, the Message Batches API, model routing across Opus/Sonnet/Haiku, and tight context discipline.

An agent that works but costs four dollars per run is not a product — it's a demo with a billing problem. The gap between a prototype that impresses in a meeting and a system you can run a million times a month is almost entirely about token economics: how much context you resend, how often you pay full price for the same prefix, how many round trips you make, and whether you reach for the most expensive model when a cheaper one would do. The good news is that the levers are concrete, measurable, and stack on top of each other.

This post walks through the techniques that move the needle most for Claude agents built on Claude Code, the Agent SDK, and the Claude API: prompt caching, request batching, model routing across the Opus/Sonnet/Haiku family, and ruthless context discipline. Each one is something you can ship this week and measure on your next bill.

Key takeaways

  • Prompt caching turns a long, stable prefix (system prompt, tools, docs) into a cheap cache read on every subsequent turn — often the single biggest saving in an agent loop.
  • Batching independent requests removes per-call overhead and, for non-urgent work, the Message Batches API trades latency for a large discount.
  • Route by difficulty: Haiku 4.5 for triage and extraction, Sonnet 4.6 for most work, Opus 4.8 only for the genuinely hard reasoning steps.
  • Context discipline — trimming history, summarizing, and not stuffing every document into every turn — compounds with everything else.
  • Multi-agent runs can use several times the tokens of a single agent, so reserve fan-out for tasks that truly parallelize.

Where the tokens actually go in an agent loop

Before optimizing, look at the shape of the spend. In a typical tool-using loop, the model resends the entire conversation on every turn: the system prompt, all the tool definitions, every prior tool call and result, and the running message history. By turn eight, you might be paying to process the same large system prompt eight times. The cost isn't dominated by the model's short replies — it's dominated by the ever-growing input that rides along on each step.

That single observation explains why prompt caching is the highest-leverage change for most agents. If the front of your prompt — system instructions, tool schemas, reference material — is identical across turns, you should be paying full input price for it once and a steep discount thereafter.

Prompt caching: pay once for the stable prefix

Prompt caching lets you mark a stable prefix so that repeated requests hit a cached version instead of reprocessing every token. Writing to the cache costs slightly more than a normal input token; reading from it costs a fraction of one. In an agent that loops a dozen times over the same system prompt and tool set, that asymmetry is decisive.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
{
  "model": "claude-sonnet-4-6",
  "system": [
    {
      "type": "text",
      "text": "<long stable instructions + tool usage guide>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "tools": [ ... ],
  "messages": [ ... ]
}

The cache_control marker tells the API to cache everything up to that point. Order your prompt from most stable to most variable — system instructions and tool definitions first, then retrieved documents, then the live conversation — so the cacheable prefix is as long as possible. The diagram below shows the decision an agent runtime should make on every turn.

flowchart TD
  A["New agent turn"] --> B{"Stable prefix unchanged?"}
  B -->|Yes| C["Cache read: pay fraction of input"]
  B -->|No| D["Cache write: pay slight premium once"]
  C --> E{"Request urgent?"}
  D --> E
  E -->|Yes| F["Synchronous call"]
  E -->|No| G["Queue to Message Batches API"]
  F --> H["Pick model by difficulty"]
  G --> H
  H --> I["Return result & log token usage"]

Note the cache has a limited lifetime, so it helps most when turns come in reasonably close together — exactly the pattern inside a single agent run. Structure your application so a session reuses the same cached prefix rather than rebuilding it from scratch each turn.

Batching: amortize overhead and trade latency for discount

There are two distinct things people mean by "batching," and both save money. The first is application-level: when an agent needs to classify forty documents or score fifty leads, don't make forty sequential calls with all the round-trip and prefix overhead — group the work so the stable context is shared. The second is the Message Batches API, which is built for high-volume, latency-tolerant workloads. You submit a batch of requests, they're processed asynchronously within a window, and you pay a substantial discount versus synchronous calls.

The rule of thumb: if a human is waiting on the answer, call synchronously and lean on caching. If the work is offline — nightly enrichment, bulk summarization, eval runs over a dataset — push it through the batch API and take the discount. Many teams run their eval suites and content-generation jobs this way and cut those line items dramatically.

Route by difficulty across the model family

Using Opus 4.8 for every step is like sending a principal engineer to reset passwords. The 2026 Claude family is designed for routing: Haiku 4.5 is fast and inexpensive and handles classification, extraction, routing decisions, and simple tool selection well; Sonnet 4.6 is the balanced default for most agent reasoning and coding; Opus 4.8 is the most capable and belongs on the genuinely hard steps — deep multi-step reasoning, tricky planning, ambiguous judgment.

A clean pattern is a cheap triage step that classifies the incoming task and dispatches to the right model. The triage call itself runs on Haiku, so it's nearly free relative to the work it routes.

def route(task):
    tier = classify_with_haiku(task)   # cheap, fast
    return {
        "trivial": "claude-haiku-4-5",
        "standard": "claude-sonnet-4-6",
        "hard":     "claude-opus-4-8",
    }[tier]

This small function can change your blended cost per task by a large factor, because in most real workloads the long tail of trivial and standard tasks vastly outnumbers the genuinely hard ones.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Context discipline: the saving that compounds

Even with caching, the variable part of your prompt grows every turn, and you pay for all of it. Keep the live context lean. Summarize older turns once they're no longer load-bearing. Don't inject an entire knowledge base when retrieval of three relevant chunks would do. Strip large raw tool outputs down to the fields the model actually needs before feeding them back. With Claude's large context windows it's tempting to dump everything in; resist it, because a 200K-token prompt processed on every one of ten turns is two million input tokens for one task.

Common pitfalls

  • Putting variable content before stable content. If anything near the front of your prompt changes per turn, your cache prefix is short and the savings evaporate. Order stable-to-variable.
  • Caching across long idle gaps. The cache expires; relying on it for requests minutes apart means paying the write premium repeatedly. Keep turns within a session close, or accept the miss.
  • Defaulting to Opus everywhere. It's the most expensive model. Route trivial and standard work to Haiku and Sonnet and reserve Opus for steps that actually need it.
  • Using multi-agent fan-out for everything. Parallel subagents can multiply token use several times over. Use them when the task genuinely decomposes, not as a default.
  • Optimizing without measurement. Log input, output, cache-read, and cache-write tokens per call. Without that breakdown you're guessing at where the money goes.

Trim your agent's bill in 5 steps

  1. Instrument every call to record model, input/output tokens, and cache hits, then find your most expensive task type.
  2. Move all stable content — system prompt, tool schemas, reference docs — to the front and mark the prefix with cache_control.
  3. Add a Haiku triage step that routes each task to the cheapest model that can do it well.
  4. Push offline, latency-tolerant work to the Message Batches API for the bulk discount.
  5. Summarize and prune conversation history each turn so the variable context stays small.

Cost levers at a glance

LeverBest forTypical effect
Prompt cachingLong stable prefix reused across turnsLarge input-cost cut inside a loop
Message Batches APIOffline, latency-tolerant bulk jobsSubstantial per-request discount
Model routingMixed difficulty workloadsLower blended cost per task
Context pruningLong multi-turn sessionsCompounding savings on every turn

Frequently asked questions

What is prompt caching?

Prompt caching is a feature that stores a stable prefix of your request so that repeated calls reuse the cached tokens at a steep discount instead of reprocessing them at full input price. It is most valuable in agent loops, where the same system prompt and tool definitions are sent on every turn.

When should I use the Message Batches API instead of synchronous calls?

Use the batch API for high-volume, non-urgent work where no human is waiting on an individual response — bulk classification, dataset enrichment, eval runs, content generation. You trade immediate latency for a significant cost reduction. For interactive, human-in-the-loop requests, call synchronously and rely on caching for savings.

Does using cheaper models hurt quality?

Only if you route the wrong tasks to them. Haiku and Sonnet handle the large majority of real agent steps — extraction, classification, routine reasoning, most coding — at high quality. The trick is a cheap triage step that sends only the genuinely hard reasoning to Opus, so you pay top price only where it earns its keep.

Why do multi-agent systems cost so much more?

Each subagent carries its own context, system prompt, and tool definitions, and the orchestrator coordinates and merges their outputs, so a fan-out of several subagents can use several times the tokens of a single agent doing the work sequentially. Use multi-agent patterns when the task truly parallelizes and the speed or quality gain is worth the multiplier.

Fast, affordable agents on the phone

CallSphere brings this same cost discipline — caching, routing, and lean context — to voice and chat, so always-on agents can handle every call and message economically while still using tools and booking work in real time. See it running at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.