Cutting Claude agent costs: caching, batching, fast runs
Keep Claude Skills and agent runs cheap and fast: prompt caching, batching, progressive disclosure, and routing across Opus, Sonnet, and Haiku.
An agent that works is only half the battle; an agent that works without quietly costing you a fortune is the other half. The moment you wrap Claude in Skills, tools, and multi-step loops, token consumption stops being a single prompt and becomes a running meter — every tool result, every re-read of the Skill instructions, every reasoning step adds to the bill and to latency. The good news is that most agentic cost is not inherent; it is waste, and waste is fixable. This post walks through the levers that actually move the needle on cost and speed when you build with Claude.
Start with a precise definition so we are optimizing the right thing. Prompt caching is a mechanism that lets Claude reuse the already-processed prefix of a request — system prompt, tool definitions, loaded Skill content — across calls, so you pay full price to process that prefix once and a steep discount to reuse it. In an agent loop where the same Skill and tool schemas ride along on every turn, that reused prefix is often the largest part of each request, which makes caching the single highest-leverage optimization available.
Why agent runs get expensive
Three forces inflate agentic cost. First, context growth: each turn appends the previous tool calls and results, so by step ten the model is re-reading nine prior steps. Second, redundant static content: the system prompt, the full Skill instructions, and every tool schema are resent on each turn even though they never change. Third, model mismatch: running a frontier model like Opus on trivial routing or formatting steps pays premium rates for work a smaller model does just as well.
Quantify before you optimize. Log input and output tokens per turn, tag each by which Skill and which tools were active, and total it per completed task. You will almost always find a small number of steps dominate spend — typically the ones that paste large tool results into context. Optimization without measurement is guesswork; one good cost dashboard pays for itself in a day.
Caching: the highest-leverage lever
Structure every request so the unchanging parts come first: system prompt, then tool definitions, then loaded Skill content, then the conversation. That ordering lets the cache hit a long stable prefix on every turn of a loop. The discount on cached prefix tokens is large enough that a chatty ten-step agent can drop its effective input cost dramatically just by making the prefix cacheable and keeping it byte-stable.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The trap is cache invalidation. Any change to the cached prefix — even reordering tools or injecting a timestamp into the system prompt — busts the cache and you pay full price again. So keep volatile data (current time, per-request ids) out of the prefix and down in the user turn. Treat your system prompt, tool list, and Skill bundle as immutable for the duration of a session.
flowchart TD
A["Agent turn begins"] --> B{"Prefix unchanged & cached?"}
B -->|Yes| C["Reuse cached prefix at discount"]
B -->|No| D["Process full prefix at full price"]
C --> E{"Step trivial?"}
D --> E
E -->|Yes| F["Route to Haiku/Sonnet"]
E -->|No| G["Route to Opus"]
F --> H["Append only new tool result"]
G --> H
H --> A
Batching independent work
Agents waste time and tokens when they do sequentially what could be done in parallel. If a Skill needs to enrich fifty records, calling a tool fifty times in a serial loop pays the prefix and round-trip cost fifty times. Where the operations are independent, design the tool to accept a batch — pass an array of ids, return an array of results — so one call replaces fifty. The model reasons once, the network round-trips once, and the cached prefix amortizes over the whole batch.
For genuinely large offline jobs where latency does not matter, use asynchronous batch processing rather than a live agent loop. Submitting many requests as a batch typically comes at a substantial discount over real-time calls, which is ideal for backfills, bulk classification, or nightly enrichment that a Skill orchestrates but does not need to watch in real time.
Parallel subagents are a related lever with a catch. Fanning work out to several subagents speeds wall-clock time, but multi-agent runs typically consume several times more tokens than a single agent doing the same work, because each subagent carries its own context and prompts. Use parallelism when the speedup is worth the multiplier — not by default.
Progressive disclosure with Skills
Skills are a cost optimization in their own right when used correctly. The whole point of dynamic loading is that Claude only pulls a Skill's full instructions into context when the task is relevant — so a library of fifty Skills costs almost nothing until one fires. Lean into this: keep each Skill's top-level description short and load heavy reference material (long examples, schemas, lookup tables) only when needed, by referencing files the agent reads on demand rather than pasting everything inline.
The anti-pattern is the mega-Skill that front-loads ten thousand words of instructions "just in case." That content rides along on every turn, inflating the cached prefix and the per-turn cost. Split it. A small always-loaded core plus on-demand detail keeps the steady-state context lean while preserving capability.
Model routing across the Claude family
Not every step deserves the most expensive model. The 2026 Claude lineup spans Opus 4.8 for the hardest reasoning, Sonnet 4.6 for balanced work, and Haiku 4.5 for fast cheap calls. A well-designed agent routes by difficulty: Opus plans and handles ambiguous reasoning, while Haiku or Sonnet does extraction, formatting, classification, and routing. You can implement this as a router step that tags each subtask with a difficulty and dispatches accordingly, or by giving cheap deterministic steps to smaller models in a multi-stage pipeline.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Tighten output, too. Output tokens cost more than input tokens, so instruct Skills to return structured, terse results — ids and statuses rather than restated prose — and reserve long natural-language generation for the final user-facing turn. An agent that narrates every internal step in flowery sentences burns output budget on text no one reads.
Frequently asked questions
Does prompt caching change the model's answers?
No. Caching reuses already-processed prefix tokens to save cost and latency; the computation and the output are identical to an uncached call. It is purely an efficiency mechanism, not a behavior change.
When should I use batch processing instead of a live agent?
When the work is large, independent, and not latency-sensitive — backfills, bulk classification, nightly enrichment. Asynchronous batches usually come at a meaningful discount over real-time calls, so reserve live loops for interactive work.
Are multi-agent systems cheaper because they parallelize?
Faster in wall-clock time, but typically more expensive in tokens — often several times more than a single agent — because each subagent carries its own context. Parallelize when speed justifies the multiplier, not by default.
How do Skills reduce cost?
Through progressive disclosure: Claude loads a Skill's full instructions only when the task is relevant, so a large Skill library stays nearly free until used. Keep descriptions short and load heavy reference material on demand to keep the per-turn context lean.
Bringing agentic AI to your phone lines
CallSphere runs these efficiency patterns — cached prefixes, batched tool calls, and model routing — under the hood of voice and chat agents that answer every call and message and book work 24/7 without runaway cost. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.