Skip to content
Agentic AI
Agentic AI7 min read0 views

Cut Claude Agent Token Cost: Caching & Batching

Make Claude agents cheaper and faster with prompt caching, batching, context trimming, and model routing. Practical token economics for production agents.

An agent that works is only half the job. The other half is making it cheap and fast enough to run thousands of times a day without the bill becoming the reason the project gets cancelled. Agentic systems are token-hungry by design: every turn resends the growing conversation, multi-agent runs fan out into several parallel contexts, and a single user request can trigger a dozen model calls. Left unmanaged, a Claude agent's cost grows roughly with the square of its conversation length, because each new turn pays to reprocess everything before it.

The good news is that token economics is one of the most controllable parts of an agentic stack. You do not need a cheaper model to cut cost in half — you need to stop paying for the same tokens over and over. This post walks through the levers that matter in production on Claude: caching, batching, context discipline, and model routing.

Where the tokens actually go

Before optimizing, measure. Instrument every model call to record input tokens, output tokens, and which were cached. You will almost always find that input tokens dominate — the system prompt, tool definitions, and accumulated history are resent on every turn, while the model's own output is comparatively small. A ten-turn agent with a 4,000-token system prompt and tool block pays that 4,000 tokens ten times if you do nothing about it.

This is why the biggest wins come from the stable, repeated portion of your prompt, not the dynamic part. The fixed scaffolding — instructions, tool schemas, few-shot examples, reference documents — is identical across turns and often across users. That is exactly what caching is built to exploit.

Prompt caching: pay once for the parts that repeat

Prompt caching lets Claude store a prefix of your prompt and reuse it on subsequent calls at a steep discount instead of reprocessing it from scratch. The mechanics are simple but order-sensitive: you mark a cache breakpoint after the stable content, and everything before the breakpoint can be served from cache on the next request as long as it is byte-for-byte identical.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Build prompt"] --> B["Stable prefix: system + tools + docs"]
  B --> C{"Cache breakpoint hit before?"}
  C -->|Yes| D["Reuse cached prefix (cheap)"]
  C -->|No| E["Process full prefix (full price)"]
  E --> F["Write prefix to cache"]
  D --> G["Process only new suffix"]
  F --> G
  G --> H["Model responds"]
  H --> I["Append turn, keep prefix stable"]
  I --> A

The practical rules follow directly from the diagram. Put everything stable at the front of the prompt — system instructions, then tool definitions, then any large reference documents — and put the volatile conversation at the end. Never interleave a changing timestamp or per-request ID into the cached region, because a single changed byte invalidates the whole prefix and you pay full price again. In a long agent loop, the cached prefix can save the large majority of input cost, which is often the difference between a viable and an unviable feature.

One caveat to design around: cache entries are short-lived, so caching helps most when requests for the same prefix arrive close together — exactly the case inside one active agent run, or across many concurrent users hitting the same shared system prompt.

Batching: throughput over latency

Not every agent task is interactive. Overnight evals, bulk document classification, generating embeddings of context, or backfilling summaries are all jobs where you care about cost and throughput, not millisecond latency. For these, batch processing runs many requests asynchronously at a significant discount versus real-time calls.

The design pattern is to split your pipeline into a synchronous path and an asynchronous one. Anything a user is waiting on stays real-time. Anything that can tolerate a delay — nightly regression evals, re-summarizing yesterday's transcripts, precomputing tool documentation — goes into a batch queue. Teams routinely cut a large chunk of their total model spend just by moving non-interactive work off the hot path.

Context discipline: the cheapest token is the one you never send

Caching makes repeated tokens cheap, but the most reliable savings come from sending fewer tokens in the first place. The instinct to stuff the entire conversation and every tool result back into the model is the number-one cause of runaway cost. Curate the context like it is expensive, because it is.

Three habits do most of the work. First, summarize tool results before they re-enter the loop: a database query that returns 5,000 rows should become a 200-token summary, not a 50,000-token JSON dump. Second, compact long-running conversations — when history grows large, replace older turns with a concise running summary the agent can still reason over. Third, scope tools per task; if a sub-agent only needs three of your twenty tools, give it only three, since every tool definition is input tokens on every turn. Claude's large context window is a capability, not an instruction to fill it.

Model routing: match the model to the task

You do not need your most capable model for every step. A well-built agentic system routes work across the Claude family: use a fast, inexpensive model like Haiku for classification, routing, and extraction; reserve a mid-tier model like Sonnet for most reasoning and tool use; and call the most capable Opus-class model only for the genuinely hard planning or synthesis steps. A common orchestrator–subagent pattern does this naturally — a smart orchestrator delegates narrow, well-specified subtasks that a cheaper model can handle reliably.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

The trap to avoid is over-routing: if your cheap model gets a task wrong and triggers a retry on the expensive model, you have paid for both. Route by measured success rate per task type, not by guesswork, and let your eval suite tell you where the cheaper model is good enough.

Frequently asked questions

What is prompt caching and when does it help?

Prompt caching stores a stable prefix of your prompt so Claude can reuse it on later requests at a large discount instead of reprocessing it. It helps most when the same prefix — system prompt, tool definitions, reference docs — is sent repeatedly in quick succession, such as within a single multi-turn agent run or across many users sharing one system prompt.

Why do my agent costs grow so fast as the conversation gets longer?

Because each turn resends the entire prior conversation as input, cost grows roughly with the square of the number of turns if you do nothing. Combat this with caching for the stable prefix and aggressive context compaction — summarizing old turns and large tool results so the resent payload stays small.

When should I use batch processing instead of real-time calls?

Use batching for any work no user is actively waiting on: nightly evals, bulk classification, re-summarizing transcripts, or precomputation. Batch runs cost meaningfully less than synchronous calls, so moving non-interactive work to an async queue cuts total spend without hurting user-facing latency.

Is using a cheaper model the best way to save money?

Not first. Cutting wasted input tokens through caching and context discipline usually saves more than swapping models, and without the quality risk. Use model routing as a second lever — cheap models for classification and extraction, capable models for hard reasoning — guided by your eval suite rather than guesswork.

Efficient agents on every call

CallSphere applies this same token economics — cached prompts, scoped tools, and model routing — to voice and chat agents that run at the scale of every inbound call, where shaving cost per conversation compounds fast. Hear efficient agentic AI in action at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.