Skip to content
Agentic AI
Agentic AI8 min read0 views

Cutting Token Cost in Agent Skills: Cache & Batch

Make Claude Agent Skills cheap and fast with prompt caching, batched tool calls, context trimming, and per-step model routing — measured, not guessed.

A Skill that works is only half the job. The other half is making it run cheaply and quickly enough to use in production. The same agentic loop that solves a hard task can also burn through tokens — re-reading the same files, re-sending a 3,000-token Skill on every turn, calling tools one at a time when it could batch them. Multi-agent setups make this worse: spawning subagents can multiply token usage several times over a single agent. If you don't measure and tune cost, a Skill that looked great in a demo becomes a line item nobody wants to defend.

This post is about the levers that move cost and latency the most when refining a Claude Agent Skill: prompt caching, batching, context discipline, and per-step model selection. Each one is concrete, measurable, and safe to apply without degrading output quality.

Key takeaways

  • Prompt caching is the single biggest win — it can cut the cost of repeated stable context by a large margin.
  • Order your prompt so stable content (system, Skill, tools) comes first and the cache hits more often.
  • Batch independent tool calls and reads instead of running them in serial round-trips.
  • Trim what re-enters context every turn; long histories cost on every subsequent call.
  • Route cheap steps to Haiku and reserve Opus for the genuinely hard reasoning turns.

Prompt caching is a feature where the model provider stores a prefix of your prompt so that repeated, identical leading content is billed and processed at a steep discount on later calls instead of being re-read from scratch every time. For an agent that sends the same system prompt and Skill on every turn, this is where most of the savings live.

Where does the money actually go?

Before optimizing, measure. For one representative run, log per-turn input tokens, output tokens, cached tokens, and the wall-clock time of each tool call. You will almost always find one of three culprits dominating: a large static prefix re-sent uncached every turn, a context window that grows unbounded as history accumulates, or serial tool calls that add round-trip latency without adding tokens you needed.

The cheapest token is the one you never send. So the optimization order is: first stop re-sending stable content (cache it), then stop re-sending stale content (trim it), then stop waiting on serial calls (batch them), and only then consider a smaller model for parts of the work.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Incoming turn"] --> B{"Prefix unchanged\nfrom last turn?"}
  B -->|Yes| C["Cache hit:\ncheap, fast prefix"]
  B -->|No| D["Cache miss:\nfull prefix billed"]
  C --> E{"Need multiple\ntool results?"}
  D --> E
  E -->|Yes| F["Batch independent calls"]
  E -->|No| G["Single call"]
  F --> H{"Step is simple?"}
  G --> H
  H -->|Yes| I["Route to Haiku"]
  H -->|No| J["Route to Opus/Sonnet"]

How do I structure a prompt so caching actually hits?

Caching keys on an exact prefix match. The rule that follows is simple but easy to violate: put everything stable at the front and everything volatile at the back. Your system prompt, the loaded Skill, and your tool definitions should appear first and byte-for-byte identical across turns. The dynamic conversation and the latest user input come last. If you inject a timestamp or a per-turn note into the system prompt, you invalidate the cache on every single turn and pay full price forever.

With the Anthropic API you mark cache boundaries explicitly. Place the breakpoint after your stable Skill and tool content:

{
  "model": "claude-sonnet-4-6",
  "system": [
    { "type": "text", "text": "<your stable system + Skill instructions>",
      "cache_control": { "type": "ephemeral" } }
  ],
  "tools": [ /* stable tool defs — also covered by the cached prefix */ ],
  "messages": [
    { "role": "user", "content": "<the volatile, per-turn content goes here>" }
  ]
}

After deploying this, check the response usage fields for cache_read_input_tokens. If that number is large relative to input_tokens on later turns, caching is working. If it stays near zero, something upstream of your breakpoint is changing between turns — hunt it down.

When should I batch versus run serially?

Batch when calls are independent; keep serial when one call's result feeds the next. If a Skill needs to read five config files to build a picture, asking for all five in one turn is faster and avoids five round-trips of model latency. But if it must read a file, decide based on its contents, then read another, batching would force a guess. The decision rule is purely about data dependency.

You encourage batching in the Skill itself: "When you need several files whose paths you already know, request them together in a single turn rather than one at a time." Pair that with tools that accept arrays — a read_files that takes a list beats five calls to read_file. The tokens are similar; the latency and overhead are much lower.

How do I keep context from ballooning?

Every turn re-sends the accumulated history, so a long run pays for its own past on every future call. Three disciplines keep it bounded. First, summarize completed phases: once a sub-task is done, replace its verbose transcript with a short result note. Second, don't dump entire files into context when a targeted slice or a search result would do. Third, for long-horizon work, use subagents that each carry their own focused context and return only a compact result to the orchestrator, rather than one agent dragging an ever-growing history.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Common pitfalls

  • Putting a timestamp in the system prompt. It silently busts the cache every turn. Keep volatile values out of the cached prefix entirely.
  • Reordering tools between turns. A reshuffled tool list changes the prefix and misses the cache. Keep tool order stable across a run.
  • Reading whole files to answer narrow questions. A 4,000-line file in context costs on every subsequent turn. Search first, read the slice you need.
  • Reaching for multi-agent by default. Subagents multiply token usage several times over; use them when the work genuinely parallelizes, not as a reflex.
  • Optimizing before measuring. You'll spend hours shaving a step that was 2% of cost while a re-sent 3,000-token prefix quietly dominates. Profile first.

Make a Skill cheap in 6 steps

  1. Instrument one real run: capture per-turn input, output, and cached tokens plus tool latency.
  2. Move all stable content to the front and add a cache breakpoint after it.
  3. Verify cache_read_input_tokens is high on later turns; if not, find what's changing.
  4. Convert independent serial tool calls into batched ones and add array-accepting tools.
  5. Add summarization of finished phases so history stops growing unbounded.
  6. Route trivial steps to Haiku, keep Opus for the hard turns, and re-measure cost and quality together.

Model choice by step type

StepModelWhy
Classify / route intentHaiku 4.5Cheap, fast, simple decision
Extract / reformat dataHaiku 4.5Low reasoning, high volume
Plan multi-step workSonnet 4.6Balanced cost and reasoning
Hard debugging / synthesisOpus 4.8Worth the cost on hard turns

Frequently asked questions

Does caching change the output?

No. Caching affects how the prefix is processed and billed, not the tokens the model sees, so output quality is unchanged. It is one of the few pure wins available.

How much can caching realistically save?

It depends on how much of your prompt is stable versus volatile, but for agents that re-send a large system prompt and Skill on every turn — most of them — the repeated portion gets a steep discount, which compounds fast over a long run.

Is mixing models in one Skill worth the complexity?

If your run has a clear split between trivial and hard steps, yes. Routing classification to Haiku and reserving Opus for synthesis can cut cost substantially with no quality loss, because the cheap steps didn't need the expensive model.

How do I stop history from exploding on long runs?

Summarize finished phases into short notes and offload parallelizable work to subagents that return compact results. The orchestrator should hold conclusions, not full transcripts.

Bringing agentic AI to your phone lines

CallSphere runs these same efficiency patterns — cached prefixes, batched tool calls, the right model per step — on voice and chat agents that answer every call and message, use tools live, and book work 24/7, fast and affordably. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.