Cutting Claude Agent Token Cost: Caching & Batching (Security Program AI Accelerated Offense)
Keep Claude agent runs fast and cheap with prompt caching, batching, model tiering, and context discipline — cut cost and latency by half without losing quality.
An agent that reasons well but costs forty cents a run is a science project. An agent that reasons well and costs four cents a run is infrastructure you can deploy across an entire security program. The difference between those two numbers is rarely the model — it is everything around the model: how much context you resend, which model handles which step, and how aggressively you cache and batch. As AI-accelerated attackers raise the volume of events your defenses must process, the economics of your own agents decide whether you can afford to run them on everything or only on a lucky few alerts.
This post is about making Claude agent runs cheap and fast on purpose. The techniques are concrete and stackable, and most teams find that combining three or four of them cuts both cost and latency by more than half without touching output quality.
Where the tokens actually go
Before optimizing, measure. In a typical multi-turn agent, the dominant cost is not the model's output — it is the input tokens, resent on every single turn. Each turn includes the system prompt, every tool definition, and the entire growing conversation history. A twenty-turn run with a large system prompt can resend that prompt twenty times. The output the model generates is often a tenth of what you pay for.
This matters because the cheapest optimization is almost always reducing repeated input, not shortening output. Once you internalize that input dominates, the priority list writes itself: cache the static parts, prune the growing parts, and avoid sending tokens to an expensive model when a cheap one will do.
Prompt caching: pay once for the stable prefix
Prompt caching is the single highest-leverage lever for agents. Claude lets you mark a stable prefix of the prompt — your system instructions, tool definitions, and any large reference material — as cacheable. On the first call you pay full price to write the cache; on subsequent calls within the cache lifetime, reading that prefix costs a small fraction of the normal input price. Because an agent resends the same system prompt and tool block on every turn, caching that prefix turns your biggest recurring cost into a near-free read.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Agent turn N"] --> B{"Stable prefix\nin cache?"}
B -->|Yes| C["Read cache:\n~1/10 input price"]
B -->|No| D["Write cache:\nfull price once"]
C --> E["Append only new\nturn tokens"]
D --> E
E --> F{"Output > threshold\nor task done?"}
F -->|Continue| A
F -->|Done| G["Return result"]The practical rules: order your prompt so everything stable comes first and everything volatile comes last, place the cache breakpoint at the boundary, and keep that prefix byte-for-byte identical across turns — a single changed character invalidates the cache. Teams new to caching often sabotage it by injecting a timestamp or a per-turn counter into the system prompt; move that volatile data into the user message instead and your hit rate jumps to nearly one hundred percent.
Model tiering: don't send everything to Opus
Not every step in an agent needs the most capable model. Claude offers a tiered family — Opus 4.8 for the hardest reasoning, Sonnet 4.6 for the broad middle, and Haiku 4.5 for fast, cheap, high-volume work. A well-designed agent routes by difficulty. Use Haiku to classify an incoming alert's severity and extract entities; use Sonnet for the main triage loop; reserve Opus for the genuinely ambiguous cases that get escalated. In a multi-agent setup, the orchestrator can run on a stronger model while parallel subagents doing narrow, well-defined work run on a cheaper one.
The savings are large because the volume distribution favors the cheap tiers. The vast majority of events are routine and handled by Haiku at a fraction of the cost; only the long tail of hard cases ever reaches Opus. Pick the model per step based on the step's actual difficulty, not the agent's overall ambition.
Batching: amortize the overhead
When you have many independent items to process — overnight log enrichment, bulk classification of a backlog of alerts, regenerating embeddings — you rarely need them answered in real time. Anthropic's batch processing accepts a large set of requests and returns them asynchronously at a meaningful discount versus synchronous calls. For any workload that is not latency-sensitive, batching is free money: the same model, the same quality, a lower bill, in exchange for waiting minutes instead of milliseconds.
The architectural move is to separate your interactive path from your bulk path. Real-time triage stays synchronous and cached; nightly sweeps, retrospective re-analysis, and report generation go through the batch path. Most security programs have far more bulk work than they realize, and routing it correctly is one of the easiest cost wins available.
Context discipline: prune the transcript
Even with caching, an unbounded conversation history grows expensive and, worse, dilutes the model's attention. Long-running agents need active context management. Summarize completed sub-tasks into a compact note and drop the verbose intermediate turns. Trim tool results to the fields that matter — a SIEM query might return fifty columns when the agent needs three. For retrieval-heavy agents, fetch only the top relevant chunks rather than dumping whole documents into context.
The Claude Agent SDK and Claude Code support patterns for this, including compaction of older turns and selective context windows. The goal is a transcript that grows roughly with the meaningful state of the task, not linearly with the number of turns. A disciplined agent doing a long job should plateau in context size, not climb forever.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Measuring cost as a first-class metric
Cheap agents stay cheap only if you watch them. Emit per-run metrics for input tokens, output tokens, cache read versus write tokens, and the model tier used at each step. Set a budget alert per run and per day. When a run blows its budget, that is a debugging signal — usually a cache miss, a loop, or a step accidentally routed to Opus. Treating cost as a monitored SLO, not an end-of-month surprise, is what keeps an agentic security program economically sustainable as your event volume grows.
Frequently asked questions
How much can prompt caching save on a Claude agent?
Because agents resend a large stable prefix — system prompt plus tool definitions — on every turn, and cache reads cost a small fraction of normal input price, caching that prefix often cuts total input cost dramatically on multi-turn runs. The key is keeping the cached prefix byte-for-byte identical across turns so it actually hits.
When should I use batch processing instead of synchronous calls?
Use batching for any workload that is not latency-sensitive — overnight enrichment, bulk classification, retrospective analysis, report generation. You get the same model and quality at a discount in exchange for asynchronous, minutes-not-milliseconds delivery. Keep interactive triage on the synchronous, cached path.
Should every agent step use the most capable model?
No. Route by difficulty: Haiku for fast classification and extraction, Sonnet for the main loop, and Opus only for genuinely ambiguous escalations. Because most events are routine, the cheap tiers handle the volume and only the long tail reaches the expensive model, which is where most of the savings come from.
What's the simplest way to keep context from ballooning?
Summarize completed sub-tasks into compact notes and drop verbose intermediate turns, trim tool results to the fields you actually use, and retrieve only the top relevant chunks instead of whole documents. The aim is for context to track meaningful task state rather than growing linearly with turn count.
Bringing agentic AI to your phone lines
Caching, model tiering, and tight context discipline are exactly what make real-time voice agents economical at scale. CallSphere applies these agentic-AI patterns to voice and chat — assistants that answer every call and message, call tools mid-conversation, and book work 24/7 while keeping each interaction fast and cheap. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.