---
title: "Cutting Claude Agent Token Costs: Caching & Batching"
description: "Keep agentic Claude runs cheap and fast with prompt caching, per-step model routing, batching, and context trimming — without losing quality."
canonical: https://callsphere.ai/blog/cutting-claude-agent-token-costs-caching-batching
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "token cost", "performance", "batching", "claude agent sdk"]
author: "CallSphere Team"
published: 2026-04-29T11:23:11.000Z
updated: 2026-06-06T21:47:43.137Z
---

# Cutting Claude Agent Token Costs: Caching & Batching

> Keep agentic Claude runs cheap and fast with prompt caching, per-step model routing, batching, and context trimming — without losing quality.

Agentic systems have a cost shape that surprises teams the first time they read the bill. A single-turn chatbot sends one prompt and gets one answer. An agent built on Claude Code or the Claude Agent SDK sends the entire growing context on every turn, and a multi-step task can run a dozen turns or more. Multi-agent designs multiply that again, because each subagent carries its own context and several subagents run per task. The result is that token cost in an agent scales with the square of how much context accumulates, not linearly. Performance engineering for agents is mostly about controlling that accumulation.

## Where the tokens actually go

Before optimizing, measure. Instrument every run to record input tokens, output tokens, and the cache-read versus cache-write split per turn, broken down by which model handled it. The usual revelation is that output tokens are a small fraction of the bill and the real cost is input: the same long system prompt, tool definitions, and conversation history resent on every single turn. A ten-turn agent with a large static preamble can spend most of its budget re-reading instructions it already received.

The second common surprise is tool-result bloat. An agent that pulls a 50-page document or a giant JSON payload into context pays for those tokens on every subsequent turn until they fall out of the window. One careless tool that dumps raw data can dominate a run's cost. So the two levers that matter most are: stop paying repeatedly for static prefixes, and stop dragging large results through the whole conversation.

## Prompt caching: stop paying for the same prefix

Prompt caching is the single biggest win for agentic workloads. Anthropic's API lets you mark a stable prefix of the prompt as cacheable; subsequent calls that share that exact prefix read it from cache at a steep discount instead of reprocessing it. For an agent, the cacheable prefix is everything that does not change across turns: the system prompt, the tool definitions, and any fixed reference material. Because an agent resends that prefix on every turn, caching it turns a recurring full-price cost into a recurring near-free one.

The discipline is to keep your prompt structured so the stable parts come first and the volatile parts — the latest user message and recent tool results — come last. Anything you append or reorder above a cache breakpoint invalidates the cache below it, so resist the urge to inject dynamic timestamps or shuffled context near the top. Cache entries are short-lived, so caching pays off most when turns happen close together, which is exactly the agentic case.

```mermaid
flowchart TD
  A["New agent turn"] --> B{"Stable prefix cached & fresh?"}
  B -->|Yes| C["Read prefix from cache (cheap)"]
  B -->|No| D["Process full prefix (write cache)"]
  C --> E["Process only new suffix"]
  D --> E
  E --> F{"Large tool result returned?"}
  F -->|Yes| G["Summarize before appending"]
  F -->|No| H["Append raw result"]
  G --> I["Continue loop"]
  H --> I
```

## Right-sizing the model per step

Not every step needs your most capable model. The Claude 4.x family spans Opus 4.8 for the hardest reasoning, Sonnet 4.6 for balanced everyday work, and Haiku 4.5 for fast, cheap, high-volume steps. A well-tuned agent routes deliberately: use Opus for planning and thorny decisions, Sonnet for the bulk of tool-using turns, and Haiku for mechanical sub-tasks like classification, extraction, or formatting a result. In a multi-agent setup, the orchestrator might run on a stronger model while narrow subagents run on a cheaper one.

The trap is using one premium model for everything because it is simplest. A classification step that runs thousands of times a day on Opus when Haiku would answer identically is pure waste. Profile your steps, find the ones that are high-volume and low-difficulty, and downshift them. Keep an eval in place so you can prove the cheaper model holds quality before you switch — the goal is cheaper at equal quality, not cheaper at any cost.

## Batching and parallelism done right

When you have many independent items to process — a hundred records to classify, a batch of documents to summarize — do not run them in a serial agent loop. For non-interactive bulk work, the Message Batches API processes large request sets asynchronously at a meaningful discount and is ideal for offline jobs where latency does not matter. For interactive work, parallelism is about latency rather than price: independent subagents or tool calls that do not depend on each other should run concurrently so the user waits once, not N times.

The pitfall is parallelizing dependent work. If step B needs step A's real output, running them concurrently just produces a hallucinated argument or a wasted call. Map the dependency graph first: fan out only the genuinely independent branches, and join them before any step that consumes their results. And remember that multi-agent fan-out spends several times the tokens of a single agent, so reserve it for tasks where the breadth genuinely pays for itself.

## Trimming context as the run grows

Even with caching, a long-running agent accumulates conversation history that grows the volatile suffix every turn. Two techniques keep it bounded. First, summarize large tool results before appending them: feed the agent a tight summary plus a handle to fetch detail on demand, rather than the raw payload. Second, compact the running history periodically — replace a long stretch of resolved tool calls with a short note of what was learned, preserving decisions and discarding transcript noise. Claude Code-style agents do this automatically as context fills, and you can apply the same idea in custom SDK loops.

Set a context budget and enforce it. When the conversation approaches a threshold, trigger compaction rather than letting context grow until it is both expensive and degraded — overstuffed context hurts quality as well as cost. The cheapest token is the one you never resend, so the engineering goal is a run whose context stays lean from the first turn to the last.

## Frequently asked questions

### What gives the biggest cost reduction in a Claude agent?

Prompt caching of the stable prefix — system prompt, tool definitions, fixed reference text. Because an agent resends that prefix on every turn, caching it converts a recurring full-price cost into a near-free cache read. Keep stable content first and volatile content last so cache hits survive.

### When should I use the Message Batches API versus parallel subagents?

Use Message Batches for offline bulk jobs where latency does not matter — it processes large request sets asynchronously at a discount. Use parallel subagents or concurrent tool calls for interactive tasks where you want to reduce the user's wait on independent branches; that is a latency win, not a price win.

### How do I decide which model to use per step?

Match capability to difficulty and volume. Use Opus 4.8 for hard planning and reasoning, Sonnet 4.6 for the bulk of tool-using turns, and Haiku 4.5 for high-volume mechanical steps like classification or extraction. Keep an eval running so you can prove a cheaper model holds quality before switching.

### How do I stop context from ballooning?

Summarize large tool results before appending them, and compact resolved history into short notes once the conversation passes a budget threshold. This keeps the volatile suffix lean, which lowers both cost and the quality loss that comes from overstuffed context.

## Bringing agentic AI to your phone lines

Fast, cheap runs matter even more on a live call, where latency is audible and volume is constant. CallSphere uses prompt caching, per-step model routing, and tight context to keep **voice and chat** agents both responsive and economical while they answer every call and book work 24/7. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/cutting-claude-agent-token-costs-caching-batching
