Cutting Token Cost in Claude Code Threat Agents
Caching, batching, model cascades, and per-run budgets that keep a Claude Code threat-detection agent fast and cheap without losing investigation quality.
A threat-detection agent that triages ten thousand alerts a day has a brutal economic constraint that a chatbot never faces: every alert costs tokens, and most alerts are noise. If each triage burns 40,000 tokens of context and reasoning, the bill and the latency both spiral until the security team starts sampling alerts instead of inspecting them — which defeats the purpose. Performance engineering for Claude Code agents is therefore not a nice-to-have; it is what makes continuous, full-coverage detection affordable. This post walks through the levers that matter, in roughly the order they pay off.
Understand where the tokens actually go
Before optimizing, measure. In a typical triage run, tokens fall into four buckets: the system prompt and tool definitions (sent every turn), the alert payload, the tool results pulled during enrichment, and the model's own reasoning. The surprising finding for most teams is that the first and third buckets dominate. A large, static system prompt re-sent on every one of ten thousand alerts is pure waste, and verbose tool results — dumping an entire JSON log when the agent needs three fields — bloat context fast.
So the first move is to log per-run token counts split by bucket. Claude Code surfaces input and output token usage; attribute them to phases with your hooks. Once you can see that 60% of input tokens are the unchanging system prompt and tool schema, the highest-leverage fix becomes obvious.
Prompt caching: the single biggest win
Anthropic's prompt caching lets you mark stable prefixes of a request so they are processed once and reused across calls at a large discount, rather than re-billed in full every time. For a detection agent, the system prompt, the tool definitions, and any static playbook text are identical across thousands of alerts — exactly the content caching is built for. Place all of it at the front of the context, mark it as cacheable, and only the per-alert tail varies.
The discipline that makes caching work is prefix stability. Cache hits require a byte-identical prefix, so anything that changes per request — a timestamp, the alert ID, the current host — must live after the cached block, never interleaved into it. Teams that sprinkle dynamic values through their system prompt get near-zero cache hits and wonder why caching "didn't help." Structure the context as: stable cached header, then the variable alert body.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Alert arrives"] --> B{"Cheap pre-filter matches?"}
B -->|Known-benign| C["Auto-close, no model call"]
B -->|Needs triage| D["Build request: cached prefix + alert"]
D --> E{"Cache hit on prefix?"}
E -->|Yes| F["Reuse cached tokens, cheap"]
E -->|No| G["Process full prefix once"]
F --> H["Haiku triage pass"]
G --> H
H --> I{"Ambiguous or high-risk?"}
I -->|Yes| J["Escalate to Sonnet/Opus"]
I -->|No| K["Emit verdict"]
Don't send what the model doesn't need to read
The cheapest token is the one you never send. A huge share of waste in detection agents comes from feeding raw tool output straight into context. When the agent queries auth logs, don't return 500 rows of every field; return a pre-summarized view — count, distinct source IPs, the three most recent events. Do the aggregation in the tool, in plain code, and hand the model a compact, decision-ready summary. This is the highest-quality-preserving optimization there is, because plain code is free and deterministic while model tokens are neither.
The same logic applies to the alert itself. Many SIEM alerts are enormous JSON blobs with dozens of irrelevant fields. Project them down to the fields your playbook actually uses before they ever reach Claude. A 2,000-token alert trimmed to 300 tokens, multiplied across thousands of alerts, is real money and real latency.
Route by difficulty: a model cascade
Not every alert needs your most capable model. A model cascade triages cheaply and escalates rarely. Run a first pass with Haiku 4.5 — fast and inexpensive — on every alert to handle the obvious benign and obvious-malicious cases. Only when Haiku reports low confidence or a high potential blast radius do you escalate the same context to Sonnet 4.6 or Opus 4.8 for deeper investigation. Because the hard cases are a small fraction of total volume, the blended cost per alert drops sharply while quality on the cases that matter stays high.
Pair the cascade with a non-AI pre-filter. A large fraction of alerts match known-benign patterns that simple rules can close without any model call at all. Spend model tokens only on genuine ambiguity. The cheapest agentic system is the one that knows when not to invoke an agent.
Batching and parallelism without runaway cost
For backfills and periodic sweeps where latency is not critical, batch processing of many alerts at off-peak schedules amortizes overhead and is well-suited to asynchronous processing. For live triage, use Claude Code's parallel subagents deliberately: spinning up several enrichment subagents to gather context concurrently cuts wall-clock latency, but remember that multi-agent runs typically consume several times more tokens than a single agent. Parallelize for speed when an analyst is waiting; stay single-agent for bulk overnight work where throughput-per-dollar wins.
Cap the runaway run
The worst cost events are not the average run; they are the pathological one that loops twenty times on a confusing alert. Set hard budgets per run — a maximum tool-call count and a maximum token ceiling enforced in a hook — and when a run hits the ceiling, stop and flag it for a human rather than letting it grind. A single uncapped agent chewing through context on a malformed alert can cost more than a thousand well-behaved runs. Budgets turn a tail-risk catastrophe into a logged, bounded event.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Frequently asked questions
What is prompt caching and how much does it save?
Prompt caching marks a stable prefix of your request — system prompt, tool definitions, static playbooks — so Claude processes it once and reuses it across many calls at a steep discount instead of re-billing it every time. For high-volume detection where that prefix is identical across thousands of alerts, it is usually the single largest cost reduction available, provided the cached prefix stays byte-identical.
How do I keep a Claude Code agent fast as well as cheap?
Trim what you send (project alerts and tool results down to decision-relevant fields), cache the stable prefix, and run a cheap first pass on Haiku that only escalates ambiguous or high-risk cases to Sonnet or Opus. Use parallel subagents to cut latency when a human is waiting, but keep bulk sweeps single-agent.
When should I use multiple agents versus one?
Use parallel subagents when wall-clock latency matters and the work is genuinely independent, since multi-agent runs typically cost several times more tokens. For overnight backfills and bulk triage where throughput-per-dollar dominates, a single agent with batched, pre-filtered input is the cheaper choice.
How do I prevent one bad run from blowing the budget?
Enforce per-run ceilings on tool-call count and total tokens inside a hook. When a run hits the cap, halt it and route to a human instead of letting it loop. This converts the rare pathological run from an unbounded cost into a bounded, logged event.
Bringing agentic AI to your phone lines
CallSphere runs these same cost disciplines — caching, cheap-first routing, and tight token budgets — under live voice and chat agents that answer every call and message and book work 24/7 without runaway bills. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.