---
title: "Cutting Claude Agent Cost: Caching and Batching"
description: "Make Claude agents cheap and fast: prompt caching, batching, model routing across Opus, Sonnet, Haiku, and context discipline. Tactics and a flowchart."
canonical: https://callsphere.ai/blog/cutting-claude-agent-cost-caching-and-batching
category: "Agentic AI"
tags: ["agentic ai", "claude", "prompt caching", "token cost", "performance", "batch api", "enterprise ai"]
author: "CallSphere Team"
published: 2026-03-20T11:23:11.000Z
updated: 2026-06-07T01:28:22.579Z
---

# Cutting Claude Agent Cost: Caching and Batching

> Make Claude agents cheap and fast: prompt caching, batching, model routing across Opus, Sonnet, Haiku, and context discipline. Tactics and a flowchart.

Agents are expensive in a way that single prompts are not. A one-shot completion costs you one round trip. An agent loops — every turn re-sends the entire growing conversation, every tool result lands back in context, and a multi-agent run can multiply that several times over. The bill that looked trivial in a prototype becomes the line item your CFO circles once you put the agent in front of real volume. Worse, the same bloat that costs money also costs latency: longer context means slower turns means a sluggish experience.

The good news is that agent cost is highly controllable, and most of the savings come from a handful of techniques that are well understood in 2026. The trick is knowing which lever applies to which workload. Prompt caching crushes the cost of stable context. Batching slashes the cost of throughput you don't need in real time. Model routing makes sure you're not paying Opus prices for Haiku work. And plain context discipline — not stuffing the window with junk — quietly beats all of them. This post walks through each lever and when to pull it.

## Key takeaways

- **Prompt caching** is the highest-leverage win: put your big stable prefix (system prompt, tool defs, docs) first and reuse the cache across turns and requests.
- **Batching** turns non-urgent work into a much cheaper async job — use it for evals, backfills, and overnight jobs.
- **Route by difficulty:** Haiku for cheap classification, Sonnet for most agent work, Opus only for the hard reasoning steps.
- The cheapest token is the one you never send — prune tool results and summarize long histories before they bloat context.
- Measure cost per *successful task*, not per token; a cheaper model that fails and retries is more expensive overall.

## Where the tokens actually go in an agent run

Before optimizing, instrument. In a typical Claude agent, the dominant cost is not the model's output — it's the input tokens re-sent on every turn. A ten-turn agent with a 5,000-token system prompt and tool definitions re-sends those 5,000 tokens ten times. Add tool results, which can be huge if a tool dumps a raw API response, and the input side dwarfs everything else. The first move is always to log input vs. output tokens per turn so you can see your real shape.

Once you see the shape, the optimization order becomes obvious. The stable prefix that repeats every turn is the prime candidate for caching. The fat tool results are the prime candidate for pruning. The expensive reasoning turns are the prime candidate for model routing. You attack in that order because the stable prefix usually carries the most repeated weight.

```mermaid
flowchart TD
  A["Incoming agent task"] --> B{"Latency-sensitive?"}
  B -->|No| C["Send via Batch API — cheaper async"]
  B -->|Yes| D{"Stable prefix reused?"}
  D -->|Yes| E["Mark prefix as cache breakpoint"]
  D -->|No| F["Send normally"]
  E --> G{"Step difficulty?"}
  F --> G
  G -->|Simple| H["Route to Haiku"]
  G -->|Standard| I["Route to Sonnet"]
  G -->|Hard reasoning| J["Route to Opus"]
```

## Prompt caching: pay once for the stable part

Prompt caching lets you mark a stable prefix of your request so that subsequent requests reusing that exact prefix read it from cache at a steep discount instead of reprocessing it. For agents this is transformational, because the system prompt, the tool definitions, and any reference documents are identical on every turn of a run — and often identical across thousands of runs. Cached reads are dramatically cheaper than fresh input tokens, and they're faster too.

The rule that makes caching work is ordering: put everything stable at the very front of your context and everything variable at the back. Caching matches on an exact prefix, so a single changing token near the top invalidates the whole cache. Concretely, that means structuring requests so your system prompt and tool schemas precede the conversation, and marking a cache breakpoint after the stable block.

```
// Conceptual shape — stable content first, marked for caching
const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  system: [
    { type: "text", text: BIG_STABLE_INSTRUCTIONS,
      cache_control: { type: "ephemeral" } }   // cache breakpoint
  ],
  tools: STABLE_TOOL_DEFS,                       // also stable, cached
  messages: conversationSoFar                    // the only part that changes
});
```

The payoff scales with run length. The longer your agent loops, the more turns reuse the cached prefix, and the bigger your savings. Teams running long agentic sessions often find caching is the single change that takes a workload from uneconomical to comfortably profitable.

## Batching: cheaper throughput when you don't need it now

A large fraction of "agent" work is not interactive. Nightly evals, document backfills, bulk classification, regenerating summaries for a knowledge base — none of these need a sub-second response. For exactly this kind of work, the Batch API processes requests asynchronously at a substantial discount versus real-time calls. You submit a job, it completes within a generous window, and you pay materially less per token.

The decision is simple: if a human is waiting on the response, keep it real-time; if a job can finish within hours, batch it. The most common mistake is running an entire eval suite or overnight enrichment job through the synchronous API out of habit, paying full price for latency nobody needs. Routing those workloads to batch is often a quiet, painless cost cut with zero user-facing impact.

## Model routing: stop overpaying for easy steps

The Claude 4.x family spans Opus 4.8 for the hardest reasoning, Sonnet 4.6 as the capable workhorse, and Haiku 4.5 for fast, cheap, high-volume work. The expensive anti-pattern is using one model for everything — usually the most capable one, "to be safe." In an agent, most turns are routine: parsing a tool result, deciding the obvious next step, formatting output. Those don't need your top model.

Practical routing means matching model to step. Use Haiku for classification, extraction, and routing decisions. Use Sonnet for the bulk of agent orchestration and tool use. Reserve Opus for the genuinely hard reasoning steps where a wrong answer is costly. You can even run a cheap model as a planner that escalates to a stronger one only when it detects ambiguity. The savings are real, but measure quality at each tier — a downgrade that increases failed-task rate can erase the savings through retries.

## Common pitfalls

- **Variable content before stable content.** Putting a timestamp or user name above your system prompt invalidates the cache every time. Stable first, always.
- **Dumping raw tool output into context.** A tool that returns a 20,000-token API blob poisons every subsequent turn. Extract only what the agent needs.
- **Optimizing tokens instead of successful tasks.** A cheaper model that retries twice costs more than the right model once. Track cost per completed task.
- **Synchronous evals.** Running your whole eval suite in real time wastes the batch discount on work no user is waiting for.
- **Never trimming history.** Long multi-agent runs balloon. Summarize or checkpoint old turns instead of carrying every word forever.

## Make your agent cheap in five steps

1. Instrument input vs. output tokens per turn so you can see where the cost concentrates.
2. Move all stable content (system prompt, tools, reference docs) to the front and add a cache breakpoint.
3. Audit tool results and prune anything the agent doesn't need before it enters context.
4. Route non-interactive jobs (evals, backfills) to the Batch API.
5. Assign the cheapest model that passes your quality bar to each step, and measure cost per successful task.

| Lever | Best for | Typical effect |
| --- | --- | --- |
| Prompt caching | Long runs with stable prefixes | Large input-cost cut, lower latency |
| Batch API | Evals, backfills, async jobs | Substantial per-token discount |
| Model routing | Mixed-difficulty steps | Pay top-tier only when needed |
| Context pruning | Tool-heavy agents | Fewer tokens, faster turns |

## Frequently asked questions

### What gives the biggest cost reduction for Claude agents?

For most agents, prompt caching of the stable prefix is the single biggest win, because the system prompt and tool definitions are re-sent on every turn. Caching them turns repeated full-price input into cheap cached reads.

### When should I use the Batch API instead of real-time calls?

Whenever no human is waiting on the result and the job can complete within a few hours — evals, bulk classification, document enrichment, and overnight jobs. You get a meaningful discount for trading immediacy you don't need.

### How do I choose between Opus, Sonnet, and Haiku in an agent?

Match model to step difficulty: Haiku for cheap classification and routing, Sonnet for most orchestration and tool use, and Opus only for the hard reasoning steps where errors are expensive. Measure quality per tier before committing.

### Does caching change the model's answers?

No. Caching only affects how the input is processed and billed, not the content. The model sees the same tokens; you just pay less to reprocess the stable prefix.

## Bringing agentic AI to your phone lines

Cost discipline is why CallSphere can run **voice and chat** agents at real call volume — caching stable instructions, routing easy turns to cheaper models, and keeping every conversation fast enough to feel human. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/cutting-claude-agent-cost-caching-and-batching