---
title: "Claude Agent Token Cost: Caching, Batching, Speed"
description: "Cut Claude agent cost and latency with prompt caching, the batches API, context discipline, and model routing — keep production runs cheap and fast."
canonical: https://callsphere.ai/blog/claude-agent-token-cost-caching-batching-speed
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "token cost", "performance", "batching"]
author: "CallSphere Team"
published: 2026-04-02T11:23:11.000Z
updated: 2026-06-06T21:47:43.828Z
---

# Claude Agent Token Cost: Caching, Batching, Speed

> Cut Claude agent cost and latency with prompt caching, the batches API, context discipline, and model routing — keep production runs cheap and fast.

Agents are expensive in a way single-shot prompts are not. A single completion costs you one set of input and output tokens. An agent that takes fifteen turns re-sends the entire growing conversation on every turn, so by turn ten you might be paying to process the same system prompt and tool schemas ten times over. Multiply that by a multi-agent system where an orchestrator spawns several subagents, and a workflow that looked cheap in a notebook becomes a line item someone in finance asks about. Performance engineering for Claude agents is mostly about not paying for the same tokens twice and not waiting on work that could run in parallel.

## Where the money actually goes

Before optimizing, measure. Instrument every Claude call to record input tokens, output tokens, cache-read tokens, cache-write tokens, model name, and wall-clock latency, then aggregate by run. The first time teams do this they are usually surprised: the bulk of spend is rarely the clever reasoning at the end. It is the fixed overhead — the system prompt, the tool definitions, the accumulated tool results — re-billed on every single turn of a long trajectory. Output tokens cost more per token, but input tokens dominate the total because agents read far more than they write.

Once you can see the breakdown, the levers become obvious. The fixed overhead is a caching problem. The growing context is a context-management problem. The turn count is an orchestration problem. The model choice is a routing problem. We will take them in turn.

## Prompt caching: stop re-billing the stable prefix

Prompt caching is the single highest-leverage optimization for Claude agents. The idea: mark the stable prefix of your prompt — system instructions, tool schemas, large reference documents — as cacheable, and Anthropic stores the processed form for a short window. Subsequent requests that reuse that exact prefix read from the cache at a steep discount and far lower latency instead of reprocessing every token.

```mermaid
flowchart TD
  A["New agent turn"] --> B{"Prefix matches cache?"}
  B -->|Yes| C["Cache read: cheap & fast"]
  B -->|No| D["Cache write: full price once"]
  C --> E["Append new turn tokens"]
  D --> E
  E --> F["Claude responds"]
  F --> G{"Prefix still stable?"}
  G -->|Yes| B
  G -->|No, prefix changed| D
```

The rule that makes caching pay off is **prefix stability**. Caching only helps when the beginning of your prompt is byte-for-byte identical across turns. So order your prompt from most-stable to least-stable: fixed system instructions and tool definitions first, then slowly-changing context, then the volatile per-turn conversation last. A common and costly mistake is injecting a timestamp or a freshly shuffled list near the top of the system prompt — it invalidates the cache on every call and you pay full price forever. Keep the volatile bits at the end where they belong.

## Batching independent work

Not all agent work is sequential. When you have many independent items to process — classify a thousand documents, summarize each ticket in a queue, evaluate every transcript from yesterday — there is no reason to wait on each one. The Anthropic Message Batches API lets you submit a large set of requests for asynchronous processing at a significant discount versus synchronous calls, with results delivered when the batch completes. If your workload tolerates minutes rather than milliseconds of latency, batching is close to free money.

Within a single agent run, the parallel equivalent is fanning out subagents. If a research task decomposes into five independent lookups, an orchestrator can spawn five subagents that run concurrently rather than one agent doing them in series. This trades tokens for latency — multi-agent runs typically consume several times more tokens than a single agent — so reserve it for genuinely parallel, high-value work rather than reaching for it by default.

## Context discipline: the cheapest token is the one you never send

Every token in the context window is a token you pay to process on the next turn, so the most reliable cost control is keeping context small. Three habits help. First, summarize and compact: when a tool returns a 50KB blob, extract the few fields the agent needs and drop the rest before it lands in history. Second, prune dead branches: once a sub-task is done, its intermediate scratch work does not need to ride along for the rest of the run. Claude Code and the Agent SDK support compaction precisely because long-running agents otherwise drown in their own history.

Third, do not over-provide. It is tempting to stuff every possibly-relevant document into the system prompt "just in case," but that overhead is re-billed every turn. Prefer giving the agent a tool to fetch a document on demand over pre-loading all of them. Retrieval at the moment of need is usually cheaper than carrying everything for the whole run.

## Model routing: match the model to the step

Not every step needs your most capable model. The Claude 4.x family spans Opus 4.8 for the hardest reasoning, Sonnet 4.6 for balanced everyday work, and Haiku 4.5 for fast, cheap, high-volume steps. A well-tuned agent routes: use Haiku for classification, extraction, and routing decisions; reserve Opus for the genuinely hard planning or synthesis steps. A common pattern is a cheap model triaging incoming work and only escalating the ambiguous cases to a stronger model — the same idea as a human team where junior staff handle routine tickets and escalate the tricky ones.

The mistake to avoid is premature downgrading. If a cheaper model fails the step, retries it, and eventually escalates anyway, you have paid for both attempts plus the extra latency. Route based on measured success rates per step, not on a hunch. Run your eval suite against each candidate model and let the data tell you the cheapest model that still passes.

## Putting it together

The cheapest, fastest agent is the one that caches its stable prefix, keeps its context lean, batches independent work, parallelizes only where parallelism is real, and routes each step to the smallest model that can do it well. None of these are exotic; they are bookkeeping. But they compound. Teams that instrument spend and apply these levers routinely see their per-run cost fall by a large multiple while latency improves at the same time, because cache reads are not just cheaper — they are faster too.

## Frequently asked questions

### What is prompt caching and how much does it save?

Prompt caching stores the processed form of a stable prompt prefix so repeated requests reuse it at a steep discount and lower latency instead of reprocessing every token. For agents that re-send a large fixed system prompt and tool schemas every turn, it is usually the single biggest cost reduction available.

### When should I use the Message Batches API?

Use it for large volumes of independent requests that can tolerate asynchronous, minutes-scale turnaround — bulk classification, summarization, or offline evals. It processes requests asynchronously at a meaningful discount versus synchronous calls, so it is ideal whenever you do not need an answer in real time.

### Does running multiple subagents save money?

No — it saves latency, not money. Multi-agent runs typically use several times more tokens than a single agent because each subagent carries its own context. Use fan-out for genuinely parallel, high-value work where speed matters, not as a default.

### How do I keep an agent's context from growing too expensive?

Compact aggressively: summarize large tool outputs before they enter history, prune finished sub-task scratch work, and fetch documents on demand with a tool rather than pre-loading everything. The cheapest token is the one you never send, since every token in context is re-billed on the next turn.

## Bringing efficient agents to your phone lines

CallSphere runs these same cost and latency disciplines on **voice and chat** agents — cached prompts and right-sized models keep responses fast and runs cheap while every call and message gets answered 24/7. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/claude-agent-token-cost-caching-batching-speed
