Skip to content
Agentic AI
Agentic AI7 min read0 views

Cutting Token Cost in Claude Agents: Caching & Batching (Building AI Agents For Enterprise)

Keep enterprise Claude agents cheap and fast with prompt caching, transcript pruning, batching, and model routing — without sacrificing answer quality.

An enterprise agent that works in a demo can become a line item nobody budgeted for once it hits real volume. The reason is structural: agents are loops, and every turn re-sends a growing context to the model. A single user request might involve eight model calls, each carrying the full system prompt, tool definitions, and a transcript that keeps expanding. Multiply that by thousands of requests a day and the token bill — and the latency — climb fast. The good news is that agent cost is highly engineerable. With a few deliberate techniques on Claude, teams routinely cut spend by large margins while making runs noticeably faster.

Token cost in an agent is the total number of input and output tokens billed across every model call in a run, and it scales with both the number of turns and the size of the context carried on each turn. Optimizing it means attacking both factors: send fewer, smaller payloads per turn, and finish runs in fewer turns. Latency tracks cost closely because larger contexts take longer to process, so the same moves usually make the agent feel snappier too.

Prompt caching is the highest-leverage lever

The biggest, easiest win is prompt caching. Claude lets you mark stable prefixes of your input — the system prompt, tool definitions, retrieved reference documents, few-shot examples — as cacheable. On subsequent calls within the cache window, those cached tokens are served at a steep discount instead of being re-billed at full rate. In an agent loop where the same long system prompt and tool schema ride along on every single turn, this is enormous, because that prefix is exactly the part that repeats unchanged.

The trick is ordering. Put your most stable content first and your most volatile content last. A typical layout is: system prompt, then tool definitions, then long-lived reference material, then the cache breakpoint, then the conversation that changes each turn. Get this ordering wrong — interleave a volatile timestamp into the middle of an otherwise-stable prefix — and you bust the cache on every call, paying full price while believing you are saving. Audit your actual request payloads to confirm the cacheable prefix is byte-identical across turns.

Map where the tokens actually go

Before optimizing, measure. Instrument each model call to log input tokens, output tokens, cached tokens, and which turn of which run they belong to. Almost every team is surprised by the breakdown — usually a bloated system prompt or an over-eager retrieval step is responsible for far more spend than the model's actual reasoning.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Incoming request"] --> B["Build context"]
  B --> C{"Stable prefix cached?"}
  C -->|Yes| D["Reuse cached tokens (cheap)"]
  C -->|No| E["Pay full prefix cost"]
  D --> F{"Simple task?"}
  E --> F
  F -->|Yes| G["Route to Haiku"]
  F -->|No| H["Route to Sonnet or Opus"]
  G --> I["Prune transcript & respond"]
  H --> I

Once you can see the breakdown, the optimization order is obvious: fix the largest contributor first. If the system prompt is 4,000 tokens and rides on every one of ten turns, trimming it to 2,000 and caching the rest saves more than any clever reasoning tweak. Measurement turns cost optimization from guesswork into a ranked to-do list.

Prune the transcript before it balloons

The conversation transcript grows with every turn, and naively carrying the full history means later turns are dramatically more expensive than early ones. Tool results are the usual culprit — a search tool that returns 6,000 tokens of raw JSON pollutes the context for the rest of the run even though the agent only needed three fields. Compress tool outputs to the fields the agent actually uses before appending them to the transcript.

For long-running agents, summarize and compact. When the transcript crosses a threshold, replace older turns with a concise summary of what was learned and decided, keeping recent turns verbatim. The Claude Agent SDK supports this compaction pattern, and it keeps per-turn cost roughly flat instead of letting it grow linearly with run length. A well-compacted agent on a twenty-turn task can cost a fraction of a naive one because the heavy early context never gets re-sent at full size.

Batch the parallelizable work

Not every step in an agent's job is sequential. When a task fans out into independent sub-questions — enrich ten leads, classify fifty tickets, summarize a dozen documents — do not loop them through one conversational agent one at a time. Issue them as parallel independent requests, or use the Message Batches API for large non-interactive jobs, which processes high volumes asynchronously at a meaningful discount over real-time calls.

The judgment call is interactivity. Batching trades latency for cost, so it shines for background work — overnight enrichment, bulk classification, report generation — where nobody is waiting on the result. For a live agent talking to a user, keep the critical path synchronous and only batch the genuinely independent side-work. Mixing these up either makes a live agent feel sluggish or makes a background job needlessly expensive.

Route by difficulty instead of always reaching for Opus

The most common cost mistake is sending everything to the most capable model. Claude offers a range — Opus for the hardest reasoning, Sonnet for the broad middle, Haiku for fast high-volume work — and the right architecture routes by task difficulty. A classifier deciding which tool to call, a step that extracts a few fields, or a yes/no gate rarely needs Opus. Run those on Haiku or Sonnet and reserve Opus for the genuinely hard planning steps.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

A clean pattern is a cheap triage step that classifies incoming work, then dispatches simple cases to a smaller model and hard ones to a larger one. Because the triage itself is a simple task, it runs on a cheap model too, so the routing overhead is negligible. Combined with caching, pruning, and batching, difficulty-based routing is what keeps an enterprise agent fast and affordable at scale without sacrificing quality on the requests that actually need it.

Frequently asked questions

How much can prompt caching actually save on an agent?

It depends on how much of your per-turn payload is stable, but in agent loops the stable prefix — system prompt, tool definitions, reference docs — is large and repeats on every turn, so cached tokens are billed at a steep discount. The savings compound with run length, which is why caching is usually the first optimization to reach for.

When should I batch instead of calling Claude in real time?

Batch when the work is independent and nobody is waiting on the result — bulk enrichment, classification, overnight reports. The Message Batches API processes large asynchronous jobs at a discount but trades away latency, so keep anything on a live user's critical path synchronous.

Does pruning the transcript hurt answer quality?

Done well, no. Compress tool outputs to the fields the agent uses and summarize old turns rather than dropping them, preserving the decisions and facts that matter. Quality only suffers if you prune information the agent still needs, so test on your regression set after changing compaction rules.

Should I just always use the cheapest model to save money?

No — a cheap model that fails and triggers retries or wrong actions costs more than doing it right once. Route by difficulty: smaller models for classification and extraction, the most capable model for hard planning. The goal is the lowest total cost of a correct outcome, not the lowest per-token price.

Bringing agentic AI to your phone lines

CallSphere runs these cost and latency patterns on live voice and chat agents — caching stable context, routing by difficulty, and keeping every call fast and affordable at scale. See it in action at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.