---
title: "Cutting Claude Agent Token Costs: Caching & Batching (Building Agents With Agent SDK)"
description: "Make Claude Agent SDK runs cheap and fast with prompt caching, context pruning, parallel tool calls, batching, and per-step model selection."
canonical: https://callsphere.ai/blog/cutting-claude-agent-token-costs-caching-batching-building-agents-with
category: "Agentic AI"
tags: ["agentic ai", "claude", "claude agent sdk", "prompt caching", "token cost", "performance"]
author: "CallSphere Team"
published: 2026-03-18T11:23:11.000Z
updated: 2026-06-06T21:47:44.337Z
---

# Cutting Claude Agent Token Costs: Caching & Batching (Building Agents With Agent SDK)

> Make Claude Agent SDK runs cheap and fast with prompt caching, context pruning, parallel tool calls, batching, and per-step model selection.

The moment an agent goes from prototype to production, the bill stops being theoretical. A single agentic loop can fire a dozen model calls, and each one re-sends the entire growing transcript — system prompt, tool definitions, every prior tool result — back to Claude. By turn ten, you're paying to re-process the same thousands of tokens over and over. Token cost in agents is rarely about the final answer; it's about the redundant context you drag along on every step.

This post is a cost-and-latency playbook for agents built on the Claude Agent SDK. The good news is that the biggest wins come from a handful of mechanical changes — prompt caching, disciplined context management, batching, and per-step model selection — none of which require sacrificing quality.

## Where the tokens actually go

In a typical agent loop, input tokens dominate the bill, not output. Every iteration re-sends the full conversation so the model can decide the next action. A system prompt of 2,000 tokens plus ten tool definitions plus accumulated tool results can easily mean 15,000 input tokens per turn — multiplied by every turn in the run. Output, by contrast, is often just a short tool call. So the first principle of cheap agents is: **shrink and stabilize the input you re-send.**

A useful definition to anchor cost reasoning: prompt caching is a mechanism where a stable prefix of your request — system prompt, tool definitions, long reference documents — is stored server-side after the first call so subsequent calls that reuse that exact prefix are billed and processed at a steep discount. It is the single highest-leverage lever in agent economics, because the cached prefix is exactly the part you re-send on every loop iteration.

## Prompt caching: the highest-leverage win

Caching only helps if your prefix is byte-stable. The classic mistake is interpolating a timestamp, a request ID, or a freshly shuffled tool order into the system prompt — any of which breaks the cache on every call and silently doubles your cost. Structure your request so the immutable parts come first and in a fixed order: system instructions, then tool definitions, then long static context, and only then the volatile conversation. Mark the boundary so the cache covers the largest possible stable prefix.

```mermaid
flowchart TD
  A["Incoming request"] --> B["Stable prefix: system + tools + static docs"]
  B --> C{"Prefix cached?"}
  C -->|Yes| D["Reuse cached prefix at discount"]
  C -->|No| E["Process full prefix & store it"]
  D --> F["Append volatile conversation"]
  E --> F
  F --> G["Claude decides next action"]
  G --> H{"More turns?"}
  H -->|Yes| C
  H -->|No| I["Return final answer"]
```

The diagram makes the payoff visible: every loop after the first reuses the cached prefix, so the cost of a ten-turn run looks far more like one full call plus nine cheap ones than ten full calls. Keep your tool catalog stable across turns within a run, and resist the urge to dynamically rewrite the system prompt mid-conversation.

## Pruning context before it bloats

Caching helps with the stable prefix, but the conversation tail grows every turn — and verbose tool results are the worst offenders. If a search tool returns a 4,000-token JSON blob and the model only needed three fields, you're now paying to re-send those 4,000 tokens on every subsequent turn. Trim tool outputs at the source: return only the fields the agent actually uses, paginate large result sets, and summarize long documents before they enter the transcript.

For long-running agents, add a compaction step. When the transcript crosses a threshold, replace older turns with a concise summary of what was decided and learned, keeping the recent turns verbatim. This bounds context growth so a 50-turn session doesn't cost quadratically more than a 5-turn one. The Claude Agent SDK's long context window — up to a million tokens — means you *can* let context grow, but "can" and "should" are different bills.

## Batching and parallel tool calls

If your agent needs three independent lookups — check inventory, fetch the customer record, and pull shipping rates — don't make it do them in three sequential turns. Claude can request multiple tool calls in a single turn when they don't depend on each other; execute them in parallel in your runtime and return all the results together. This collapses three model round-trips into one, cutting both latency and the input-token tax of re-sending context between each call.

The same logic applies across requests. If you're processing a backlog of independent tasks rather than a live conversation — classifying a thousand support tickets, say — batch them through an asynchronous batch path rather than firing them as real-time calls. Throughput-oriented batch processing trades immediacy for a meaningful per-token discount, which is exactly the right trade for offline work where nobody is waiting on the response.

## Right-size the model per step

Not every step in an agent deserves your most expensive model. A run might use Opus 4.8 for the hard planning decision, then hand off routine extraction and formatting steps to Sonnet 4.6 or Haiku 4.5. Routing by step is one of the most underused cost levers: classification, simple tool routing, and short summarization are well within Haiku's reach at a fraction of the price, while you reserve the flagship model for the genuinely ambiguous reasoning.

Measure before you optimize, though. Instrument per-turn token counts and per-step model usage, then look at where the spend concentrates. Often a single chatty tool or one bloated reference document accounts for most of the bill, and fixing that one thing beats a dozen micro-optimizations. Cheap agents come from knowing your token budget the way you'd know a latency budget — as a number you watch, not a surprise you discover on the invoice.

## Frequently asked questions

### Does prompt caching change the agent's answers?

No. Caching only affects how the input prefix is billed and processed — the model sees the same tokens and produces the same distribution of responses. It's a pure cost-and-latency optimization, which is why it should be the first thing you enable on any production agent.

### What breaks prompt caching without me noticing?

Any change to the cached prefix: a timestamp or request ID in the system prompt, reordered or dynamically generated tool definitions, or per-user text injected before the static section. Keep everything volatile *after* the stable prefix and verify your cache-hit rate in logs rather than assuming it works.

### When should I use batch processing instead of live calls?

Use the asynchronous batch path whenever no human is waiting on the result in real time — bulk classification, backfilling enrichment, nightly summarization. You trade immediacy for a per-token discount, which is the right call for offline throughput work but wrong for interactive agents.

### How do I decide which model to use for each step?

Start by measuring per-step token cost and difficulty. Route ambiguous planning and multi-step reasoning to Opus 4.8, and push deterministic extraction, classification, and formatting to Haiku 4.5 or Sonnet 4.6. Validate the cheaper model against your evals before locking it in for that step.

## Bringing agentic AI to your phone lines

CallSphere runs these same cost disciplines — cached prefixes, pruned context, and parallel tool calls — under the hood of **voice and chat** agents that answer every call and message and book work 24/7 without the runaway bill. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/cutting-claude-agent-token-costs-caching-batching-building-agents-with
