Skip to content
Agentic AI
Agentic AI7 min read0 views

Cutting Claude Opus token cost in Claude Code: cache & batch

Keep Claude Opus agents cheap and fast — prompt caching, the Batches API, effort tuning, and the silent invalidators that quietly break your cache.

Opus 4.8 is the most capable model in the Claude family, and at $5 per million input tokens and $25 per million output it is also the one whose bills get attention. The good news is that a well-built agent spends most of its tokens at roughly a tenth of list price, because the bulk of an agentic run is the same context read over and over. The difference between a cheap agent and an expensive one is rarely the model — it's whether your prompt-building code lets the cache do its job, whether you batch what doesn't need to be live, and whether you've tuned how hard the model thinks. This post is about keeping runs both cheap and fast without dropping to a weaker model.

The one invariant: caching is a prefix match

Prompt caching is the single biggest lever, and it rests on one rule: the cache is a prefix match, and any byte change anywhere in the prefix invalidates everything after it. The API renders your request in a fixed order — tools, then system prompt, then messages — and derives the cache key from the exact bytes up to each cache_control breakpoint. Get the ordering right and caching mostly works for free; get it wrong and no number of cache markers will save you.

What this means in practice: stable content goes first, volatile content goes last. A large system prompt and a deterministic tool list belong at the front, with a breakpoint on the last system block so tools and system cache together. The per-turn question, the timestamp, the request ID — anything that changes every call — goes after the last breakpoint. Cache reads cost about 0.1x base input price and writes cost about 1.25x for the default five-minute TTL, so on any agent that reuses context across turns, the economics are decisively in your favor after the second request.

flowchart TD
  A["Build request"] --> B["Tools (sorted, frozen)"]
  B --> C["System prompt (no timestamps)"]
  C --> D["cache_control breakpoint"]
  D --> E["Conversation history"]
  E --> F["Volatile suffix: this turn's question"]
  F --> G{"cache_read_input_tokens > 0?"}
  G -->|Yes| H["Paying ~0.1x — cache working"]
  G -->|No| I["Silent invalidator — audit the prefix"]

Hunting silent invalidators

The most expensive bug in agent code is a cache that never hits and never errors. You only notice it on the invoice. The way to catch it is to read usage.cache_read_input_tokens on your responses — if it's zero across repeated requests with what you believe is an identical prefix, something is quietly changing the bytes. The usual suspects: a datetime.now() or a UUID interpolated into the system prompt, a json.dumps() of a tool schema without sort_keys=True producing non-deterministic key order, a tool set that's rebuilt per user, or a conditional system section that differs by flag. Each one shifts the prefix and torches every cache entry downstream.

Two architectural habits prevent most of them. Freeze the system prompt — never interpolate "current date" or "user name" into it; inject that context later in the messages instead, where it invalidates nothing before its position. And serialize tools deterministically, sorting by name, so the same logical tool set always produces the same bytes. When you need to change context mid-conversation, append a role: "system" message to the messages array rather than editing the top-level system prompt — it preserves the cached prefix instead of rebuilding it.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Batch everything that isn't interactive

Not every Claude call needs to happen the instant you ask. Evals, bulk classification, document processing, offline analysis — none of these are latency-sensitive, and all of them run at half price through the Message Batches API. You submit up to 100,000 requests at once, most batches finish within an hour, and every token — input, output, cached — is billed at 50% of standard. For an agent platform, the pattern that pays off is splitting your workload: keep the live, user-facing turns on the standard endpoint with caching, and push the offline work — nightly evals, backfills, summarization of completed sessions — through batches.

Batching composes with caching, too. If a thousand batch requests share a large preamble — the same instructions, the same retrieved document — put the breakpoint at the end of that shared portion and let every request read it. You pay the 50% batch discount and the cache read discount on the shared bytes simultaneously.

Tune effort, don't downgrade the model

The instinct when costs rise is to swap Opus for a cheaper model. Before you do, reach for the effort parameter, which controls how deeply the model thinks and how many tokens it spends per response. On Opus 4.8 you have low, medium, high, xhigh, and max; lower effort means fewer and more consolidated tool calls, less preamble, and terser confirmations. The relationship between effort and total cost isn't even monotonic on agentic work — higher effort up front often reduces the number of turns, so a run at high can finish cheaper than the same task fumbled across many turns at low.

The discipline is to sweep medium, high, and xhigh on your own eval set and pick per route. Reserve max for the genuinely hard, latency-insensitive cases. For long agentic loops, pair effort with a task budget so the model self-moderates its total spend across the whole run. This keeps you on the most capable model while still controlling the bill — which is almost always better than trading away intelligence you'll wish you had.

Token accounting that actually adds up

One last trap: input_tokens in the usage object is the uncached remainder only. The true prompt size is input_tokens + cache_creation_input_tokens + cache_read_input_tokens. Teams sometimes look at a small input_tokens after a multi-hour run and assume the agent barely used the model, when in fact almost everything was served from cache. Sum the three fields to understand real usage, and when you need an exact count before a request — to estimate cost or check a context limit — use the count_tokens endpoint with the same model ID, never a third-party tokenizer, which will mis-estimate Claude tokens badly.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Frequently asked questions

How do I know if my prompt cache is actually working?

Read usage.cache_read_input_tokens on repeated requests with the same prefix. A consistent non-zero value means the cache is hitting. A persistent zero means a silent invalidator — a timestamp, a UUID, or non-deterministic JSON serialization — is changing your prefix bytes every call.

When should I use the Batches API instead of live calls?

Any time latency doesn't matter: evals, bulk classification, document processing, nightly backfills. You get a flat 50% discount on all token usage, and it stacks with prompt caching on shared preambles. Keep only the interactive, user-facing turns on the standard endpoint.

Is dropping from Opus to Sonnet the right way to cut cost?

Try effort tuning first. Lowering effort from high to medium, or adding a task budget, often recovers the savings without giving up Opus-level capability. Higher effort can even cost less on agentic work by reducing turn count. Downgrade the model only when an eval shows the cheaper model meets your quality bar.

Why does my agent's input_tokens look so low?

Because input_tokens counts only the uncached portion. The cached prefix is reported separately in cache_read_input_tokens. Add all three usage fields together to see the real prompt size and your true cost picture.

Bringing agentic AI to your phone lines

CallSphere runs these same cost disciplines — aggressive prompt caching, batched offline work, tuned effort — under voice and chat agents that handle every call and message, call tools mid-conversation, and book work 24/7 at production economics. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.