Skip to content
Agentic AI
Agentic AI8 min read0 views

Cutting Claude Agent Token Cost: Caching and Batching

Keep Claude Code agent runs cheap and fast with prompt caching, batched tool calls, leaner tool results, and matching the right model to each step.

An agent that works is the first milestone. An agent that works cheaply is the one you can actually ship to every user. The gap between those two is mostly tokens — how many you feed the model per turn, how many turns a task takes, and how often you pay full price for context the model has already seen. When a single Claude Code session can read dozens of files and call tools across many turns, small inefficiencies compound into real money and real latency. This post is about keeping agent runs fast and inexpensive without dumbing them down.

Where the tokens actually go

Before optimizing anything, find out where your spend lives. In a typical agentic run the cost breaks into three buckets: the system and Skill instructions resent on every turn, the accumulated conversation (prior tool calls and their results) that grows with each step, and the tool result payloads themselves, which are often the largest single contributor. A tool that dumps a 40 KB JSON blob into context on every call will quietly dominate your bill.

Multi-agent designs multiply this. A multi-agent system is one where an orchestrator delegates subtasks to separate subagents that each run their own context window. That isolation is great for focus, but it means multi-agent runs typically burn several times more tokens than a single agent doing the same work, because the orchestrator and each subagent carry overlapping context. Reach for multiple agents when the parallelism genuinely pays off, not by default.

The practical move is to instrument first. Log input and output tokens per turn and per tool. Once you can see that 60% of your spend is one verbose tool, the optimization writes itself: that tool should return less.

Prompt caching is the highest-leverage win

Prompt caching is the single biggest lever for agentic cost. The idea is simple: the stable prefix of your prompt — system instructions, Skill content, tool definitions, large reference docs — doesn't change between turns, so the model provider can cache it and charge a fraction of the normal input price to reuse it. Cached reads are dramatically cheaper than fresh input tokens, and on a long agent run where that prefix is resent every single turn, the savings are enormous.

To benefit, you have to keep the cacheable part stable and at the front. Put the unchanging material — system prompt, Skill instructions, tool schemas, static reference data — at the very beginning of the context, and let the volatile material (the latest user turn, fresh tool results) come after. The moment you mutate something near the top, you invalidate the cache for everything below it and pay full price again.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["New turn in agent run"] --> B{"Stable prefix unchanged?"}
  B -->|Yes| C["Reuse cached prefix at low cost"]
  B -->|No| D["Reprocess prefix at full price"]
  C --> E["Process only new tokens"]
  D --> E
  E --> F{"Tool result large?"}
  F -->|Yes| G["Summarize or paginate before adding"]
  F -->|No| H["Append result"]
  G --> I["Model continues"]
  H --> I

A common self-inflicted wound is putting a timestamp or a per-turn counter near the top of the system prompt. It feels harmless, but it busts the cache on every turn. Keep volatile values out of the cached region entirely.

Make tools return less

Tool results are where careless designs bleed tokens. An agent rarely needs an entire API response; it needs the few fields relevant to the decision at hand. Design tools to return compact, purpose-built payloads: the fields that matter, not the raw upstream object. If a tool can return 200 rows, give it pagination and a sensible default limit so the model asks for more only when it needs to.

The same logic applies to file reads in coding agents. Reading a 3,000-line file to change one function is wasteful. Prefer targeted reads — a search that returns matching lines with a little surrounding context — over loading whole files into the window. Claude Code leans on exactly this pattern, and you should mirror it in your own Skills: tell the model to search and read narrowly, not to slurp everything "just in case."

When a large payload is genuinely needed mid-run, summarize it before it lands in context. A short, structured digest of a long document costs a fraction of the original and usually preserves everything the model's next decision depends on.

Batch the work, don't serialize it

Latency, not just cost, is shaped by how many sequential turns a task takes. Every turn is a full round trip to the model. If your Skill nudges the agent to do one tiny thing per turn, you pay for that round-trip overhead repeatedly. Encourage batching: when several independent tool calls have no dependency between them, issue them together in a single turn rather than one at a time. Claude can request multiple tool calls at once, and reading three files in parallel is far faster than three separate turns.

Write this into the Skill explicitly: "If you need to read several files and they don't depend on each other, request them in one step." The model won't always batch on its own; a direct instruction reliably collapses a five-turn sequence into one or two. Independent work fanned out, dependent work sequenced — that's the rule.

Match the model to the step

Not every step needs the most capable model. The Claude 4.x family spans Opus 4.8 for the hardest reasoning, Sonnet 4.6 as a strong all-rounder, and Haiku 4.5 for fast, cheap, high-volume work. A well-tuned pipeline uses them deliberately: a small, frequent classification or routing step can run on Haiku, while the gnarly planning or code-generation step that actually needs deep reasoning runs on Opus or Sonnet.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

In a multi-agent setup this is especially powerful. The orchestrator that decomposes a task may warrant a stronger model, while narrow, well-specified subagents can often run on a smaller one. You pay top-tier prices only where they buy you something. The art is knowing which steps are genuinely hard — over-downgrade and you'll spend more on retries and loops than you saved on tokens.

Measure, then guard the gains

Cost optimization isn't a one-time pass; it's a number you have to defend. Add token and latency tracking to your runs and watch it across releases, because a single innocent-looking change — a tool that now returns one extra field, a Skill edit that moves a volatile value into the cached region — can silently undo weeks of savings. Treat a cost regression like a performance regression: something to catch in review, not in the bill at month's end.

The teams that keep agents cheap aren't the ones who optimized once. They're the ones who made cost observable, kept the cacheable prefix stable, trimmed every tool to its essential output, batched independent work, and routed each step to the smallest model that could do the job. None of those moves are exotic. They just have to become habits.

Frequently asked questions

What is prompt caching and why does it matter for agents?

Prompt caching lets the provider reuse the unchanging prefix of your prompt — system instructions, Skill content, tool schemas — at a fraction of the normal input cost. On long agent runs that resend the same prefix every turn, it's the single largest cost saving available, as long as you keep that prefix stable and at the front.

Why are multi-agent runs so expensive?

Each subagent runs its own context window, and the orchestrator plus subagents carry overlapping context, so a multi-agent run typically uses several times more tokens than one agent doing the same task. Use multiple agents only when the parallelism genuinely outweighs the extra cost.

How do I reduce tokens spent on tool results?

Design tools to return compact, purpose-built payloads instead of raw upstream objects, paginate large result sets, and summarize big documents before they enter context. For coding agents, prefer targeted searches over reading whole files.

Should every step use the most capable model?

No. Route cheap, high-volume steps like routing or classification to a smaller, faster model, and reserve the strongest model for steps that truly need deep reasoning. Matching model to step is one of the cleanest ways to cut cost without losing quality.

Bringing agentic AI to your phone lines

Voice agents live or die on latency and cost — a caller won't wait, and the math has to work at scale. CallSphere applies these same efficiency patterns to voice and chat: cached context, lean tool calls, and the right model per step, so agents answer instantly around the clock. See it live at callsphere.ai.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.