Skip to content
Agentic AI
Agentic AI8 min read0 views

Cutting Claude Agent Token Cost: Caching, Batching, Speed

Lower Claude agent cost and latency with prompt caching, batching, context pruning, and model routing — concrete tactics and honest tradeoffs.

An agent that works is the easy part. An agent that works and doesn't quietly cost you four figures a day is the part that separates a demo from production. Agentic runs are token-hungry by nature: every turn re-sends the system prompt, the tool definitions, the conversation so far, and the growing pile of tool results. Multi-agent setups multiply that by the number of subagents. If you don't actively manage tokens and latency, an agent that handles ten tasks fine will fall over at ten thousand.

The good news is that the levers are concrete and most of them stack. This post walks through prompt caching, batching, context discipline, and model routing — and where each one actually pays off versus where it's a rounding error.

Key takeaways

  • Prompt caching is the highest-leverage win for agents because the stable prefix (system prompt + tools + skills) repeats on every turn.
  • The Message Batches API cuts cost substantially for non-interactive, parallelizable work like evals and bulk processing.
  • Context grows quadratically in cost if you let every tool result accumulate — prune and summarize aggressively.
  • Model routing — Haiku for cheap classification, Sonnet for the workhorse, Opus for the hard reasoning — often beats one-model-for-everything.
  • Measure cost per completed task, not cost per token; a cheaper model that loops is more expensive overall.

Where the tokens actually go

Start by understanding the shape of agent spend. In a typical tool-using run, the input tokens dwarf the output tokens, because each turn re-submits everything that came before. By turn fifteen, you may be paying to re-read fourteen turns of tool results the model has already digested. The cost curve bends upward as the conversation grows. That single fact tells you where to aim: shrink and stabilize the repeated input.

The two biggest structural costs are the unchanging prefix (system prompt, tool definitions, loaded skills) and the accumulating suffix (conversation history and tool outputs). Caching attacks the prefix; pruning and summarization attack the suffix. Do both.

Prompt caching: the prefix you re-read every turn

Prompt caching lets you mark a stable portion of your prompt so the model reuses a cached representation instead of reprocessing it. For agents this is enormous, because the system prompt, the full set of tool definitions, and any loaded skill content are identical on every single turn of a run. Mark them as cacheable and you pay the discounted cache-read rate for that prefix on turns two onward, while the cache stays warm.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The rule that trips people up: caching works on a prefix. Anything before your cache breakpoint must be byte-identical between calls. So order your prompt stable-to-volatile — system instructions and tools first, then the dynamic conversation. Put the cache breakpoint after the stable block. If you interleave a changing timestamp into the system prompt, you've just invalidated the cache on every turn.

client.messages.create(
    model="claude-sonnet-4-6",
    system=[{
        "type": "text",
        "text": LONG_STABLE_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"}
    }],
    tools=TOOLS,   # also stable — cache the tool block too
    messages=conversation,
)
flowchart TD
  A["Agent turn"] --> B{"Stable prefix cached?"}
  B -->|Cache hit| C["Pay cache-read rate"]
  B -->|Miss / expired| D["Pay full input + write cache"]
  C --> E["Process volatile suffix"]
  D --> E
  E --> F{"Context too long?"}
  F -->|Yes| G["Summarize old turns"]
  F -->|No| H["Emit next tool_use"]
  G --> H

Batching: trade latency for a lower bill

Not every agent task needs an answer in two seconds. Evals, nightly document processing, bulk classification, and offline enrichment can all tolerate minutes of delay. The Message Batches API exists exactly for this: you submit many requests as one batch, accept asynchronous completion within a generous window, and pay a meaningfully lower rate per request than synchronous calls.

The decision is simple. If a human is waiting on the result, go synchronous. If a queue or a cron job is waiting, batch it. The most common mistake is running an entire evaluation suite through the synchronous endpoint at full price when it could have gone through batches overnight for a fraction of the cost. Batching also smooths out rate-limit pressure, since you're not hammering the live endpoint.

Context discipline: stop paying for stale history

Every tool result you keep in the conversation gets re-billed on every subsequent turn. A 5,000-token API dump from turn three is still costing you on turn twenty even though the model only needed one field from it. The fix is to treat context as a budget you actively manage.

Three tactics. First, summarize: after a tool returns a large blob, replace it in history with a compact extract of what mattered. Second, drop: once a sub-task is complete, remove its intermediate tool results entirely. Third, fetch lazily: instead of dumping a whole document into context, give the agent a tool to retrieve the specific section it asks for. For multi-agent systems this matters even more — give each subagent only the slice of context it needs, and have it return a tight summary to the orchestrator rather than its full working transcript.

Model routing: right-size every call

Using your most capable model for everything is the equivalent of taking a freight truck to buy groceries. A lot of agent work is cheap: classifying intent, extracting a field, deciding which tool to call. Route those to Haiku. Use Sonnet as the default workhorse for most tool-using turns. Reserve Opus for genuinely hard planning and reasoning steps where the quality difference pays for itself.

A practical pattern is a cheap router: a Haiku call classifies the request, and only the hard branch escalates to a bigger model. You can also run the bulk of an agent on Sonnet and call Opus for a single critical decision. The point is that model choice is a per-call decision, not a per-project one.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Routing pairs naturally with caching and context discipline rather than competing with them. The cheap router classifies against a small, stable prompt that caches well, and the escalated reasoning step runs on a pruned context so the expensive model is not paying to wade through stale tool dumps. Stack all three and the savings compound — a run that once burned through a large model on every turn now spends most of its turns on Haiku against cached prefixes, escalating only where the difficulty genuinely warrants it.

Common pitfalls

  • Invalidating the cache by accident. A timestamp, a per-user ID, or a reordered tool list in the prefix kills your cache hit rate. Keep the prefix byte-stable.
  • Optimizing tokens while ignoring loops. A cheaper model that loops twice as often costs more, not less. Measure cost per completed task.
  • Letting context grow unbounded. Without pruning, long agent runs degrade in both cost and quality as the model wades through stale results.
  • Synchronous everything. Running evals and batch jobs through the live endpoint at full price when batches would do.
  • Multi-agent by default. Subagents multiply tokens. Use them when the parallelism genuinely helps, not because the architecture looks impressive.

Cut agent cost in 5 steps

  1. Add a cache breakpoint after your stable system prompt and tool definitions.
  2. Move all non-interactive work to the Message Batches API.
  3. Summarize or drop large tool results once they've been used.
  4. Route cheap turns to Haiku and reserve Opus for hard reasoning only.
  5. Track cost per completed task and watch it after every change.

When to reach for each lever

LeverBest forMain tradeoff
Prompt cachingAny multi-turn agentPrefix must stay byte-stable
BatchingEvals, bulk, offline jobsAsynchronous latency
Context pruningLong-running agentsRisk of dropping needed detail
Model routingMixed-difficulty workloadsAdded routing complexity

Frequently asked questions

What is prompt caching and why does it help agents most?

Prompt caching reuses a precomputed representation of a stable prompt prefix so the model doesn't reprocess it each call, charging a lower cache-read rate instead. Agents benefit most because their system prompt, tools, and skills repeat unchanged on every turn of a run.

When should I use the Message Batches API?

Use batching whenever no human is waiting on the result — evals, bulk document processing, classification jobs — to pay a lower per-request rate in exchange for asynchronous completion within a longer window.

Does using a cheaper model always save money?

No. If a cheaper model loops, retries, or fails the task, the total cost can exceed a more capable model that succeeds on the first pass. Always measure cost per completed task rather than per token.

How do I keep token cost from growing during long runs?

Summarize large tool results after use, drop intermediate results once a sub-task finishes, and fetch document sections lazily through a tool instead of dumping everything into context.

Fast, frugal agents on the phone

CallSphere applies these same cost and latency tactics — caching, routing, and lean context — to voice and chat agents that answer every call in real time without a runaway bill. See the live version at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.