Skip to content
Agentic AI
Agentic AI7 min read0 views

Claude Agent Token Costs: Caching, Batching, Speed

Cut Claude agent costs with prompt caching, the Batches API, effort tuning, and model routing — keep runs cheap and fast without losing quality.

An agent that works but costs $4 per run is not a product — it's a demo with a billing problem. The gap between a Claude agent that ships and one that quietly bankrupts a feature is almost entirely about how you manage tokens. And token economics for agents are different from single calls: a multi-step run re-sends its entire growing history on every turn, so a ten-turn agent can pay for the same context ten times if you're not careful.

This post is about the four levers that actually move the needle on cost and latency — prompt caching, batching, effort tuning, and model routing — and the order in which to reach for them. None of them require sacrificing the output quality your users notice; they trade away spend you were wasting.

Where the tokens actually go

Before optimizing, measure. The usage object on every Claude response breaks tokens into four buckets: input_tokens (uncached, full price), cache_creation_input_tokens (written to cache, ~1.25× price), cache_read_input_tokens (served from cache, ~0.1× price), and output_tokens. The trap is reading input_tokens in isolation — if it shows 4K but your agent processed a 200K-token transcript, the rest was cached. Total prompt size is the sum of all three input fields.

For an agent, the dominant cost is usually re-processing the conversation prefix on every turn. A run with a 30K-token system prompt and tool set, over twelve turns, re-sends that 30K twelve times — 360K tokens of pure repetition. Caching that prefix is the single highest-leverage change you can make, and it's why we start there.

Lever one: prompt caching

Prompt caching is a prefix match: the API caches the rendered prompt up to a cache_control breakpoint, and any byte change before that point invalidates everything after it. The render order is toolssystemmessages, so a breakpoint on your last system block caches both your tool definitions and your system prompt together.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Agent request"] --> B{"Prefix byte-identical\nto a cached entry?"}
  B -->|Yes| C["cache_read ~0.1x price"]
  B -->|No| D{"Why the miss?"}
  D -->|datetime/UUID in system| E["Move volatile data\nafter breakpoint"]
  D -->|tool set changed| F["Freeze & sort tools"]
  D -->|first request| G["cache_creation ~1.25x\n(pay once)"]
  E --> B
  F --> B

The fix list is short and mechanical. Freeze the system prompt — never interpolate datetime.now(), a session UUID, or a per-request flag into it, because that changes the prefix every call and your cache hit rate drops to zero. Serialize tool definitions deterministically (sort by name) so they're byte-identical run to run. Put volatile content — the user's varying question, timestamps — after the last breakpoint. Then verify: if cache_read_input_tokens is zero across repeated requests with the same prefix, a silent invalidator is at work, and you find it by diffing the rendered prompt bytes between two requests.

For long-running agents, place the breakpoint on the last content block of the most recently appended turn, so each request reuses the entire prior conversation as a cached prefix. One caveat specific to agentic loops: each breakpoint only walks back 20 content blocks to find a prior cache entry, and a single turn with many tool_use/tool_result pairs can blow past that — drop an intermediate breakpoint every ~15 blocks in long turns.

Lever two: batching the work that can wait

Not every agent run is interactive. Overnight enrichment, bulk classification, evaluation sweeps, and report generation are latency-tolerant — and the Batches API processes them at 50% of standard price. You submit up to 100,000 requests per batch, most complete within an hour (24-hour ceiling), and every Messages API feature works inside a batch, including prompt caching.

Batching composes beautifully with caching for fan-out workloads: a shared system prompt with a large document, cached once, reused across thousands of batched requests that each ask a different question. You pay the cache write a single time and the 0.1× read price on every subsequent request, on top of the 50% batch discount. For any workload where the user isn't watching a spinner, this is close to free money.

Lever three: tune effort, don't just lower it

On Opus 4.8, the effort parameter (output_config: {effort: "low"|"medium"|"high"|"xhigh"|"max"}) controls how much the model thinks and acts. Lower effort means fewer, more consolidated tool calls, less preamble, terser confirmations. But effort is a dimension to test, not a knob to crank down blindly — on agentic work, higher effort up front often reduces total turn count and total cost by planning better, even though each turn is more expensive.

The practical approach is to sweep medium, high, and xhigh on your own eval set and pick per route. Use high as the default, xhigh for coding and complex agentic loops, and reserve max for genuinely hard, latency-insensitive tasks. Pair it with adaptive thinking (thinking: {type: "adaptive"}), which lets Claude decide per-request how much to reason. For runaway-cost protection on long loops, Task Budgets let the model see a token countdown and self-moderate — distinct from max_tokens, which is an enforced ceiling the model never sees.

Lever four: route the cheap work to a cheaper model

A multi-agent pattern is the cleanest way to spend less without dumbing down the main loop. Keep the orchestrator on Opus 4.8 for the decisions that matter, and spawn subagents on Haiku 4.5 for bounded, well-scoped tasks — file searches, simple extractions, parallel reads. Claude Code's exploration subagents work exactly this way. The catch worth knowing: switching models mid-conversation invalidates the cache (caches are model-scoped), which is precisely why you isolate the cheap model in a separate subagent call rather than swapping models inside one loop.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

A citable summary: keeping a Claude agent cheap and fast is the practice of paying for each unique token once — caching the stable prefix, batching the latency-tolerant work, tuning effort per route, and routing scoped subtasks to a smaller model — rather than re-processing the same context on every turn.

Frequently asked questions

Why is my Claude cache hit rate zero?

Something in your prefix changes every request. The usual culprits are a timestamp or UUID interpolated into the system prompt, a non-deterministically serialized tool list, or a varying tool set. Diff the rendered prompt bytes between two requests to find the invalidator, then move the volatile part after the last cache_control breakpoint.

When should I use the Batches API instead of regular calls?

Whenever the user isn't waiting on the result in real time — bulk classification, enrichment jobs, eval runs, scheduled reports. You get a 50% discount, and most batches finish within an hour.

Does lower effort always save money on agents?

No. On multi-step agentic work, higher effort often reduces total cost by cutting turn count through better planning. Sweep medium/high/xhigh on your eval set and choose per route rather than defaulting to the lowest.

How do I use a cheaper model without losing quality?

Keep the orchestrator on Opus and delegate bounded subtasks to Haiku subagents. Isolate the model swap in a separate subagent call, since switching models mid-loop invalidates the prompt cache.

Bringing agentic AI to your phone lines

CallSphere runs these same economics on live voice and chat — caching stable prompts, routing simple turns to faster models, and keeping latency low enough that callers never notice they're talking to an agent. See the cost-aware version in action at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.