Skip to content
Agentic AI
Agentic AI8 min read0 views

Cutting Token Cost in an LLM Code Security Scanner

Use prompt caching, batching, and diff-scoped context to keep a Claude-powered source-code security agent fast and cheap without losing coverage.

An LLM that reviews source code for security flaws has an appetite. Point it at a 200-file service and a naive implementation will read every file in full, re-send the entire system prompt on every turn, and re-scan unchanged code on every pull request — and your bill, and your latency, will reflect all of it. The good news is that source-code review is unusually amenable to optimization, because most of a repository is stable, most pull requests touch a tiny fraction of it, and Claude offers concrete mechanisms — prompt caching, batching, and disciplined context scoping — to exploit that stability. This post is about making a security scanner that is both cheap and fast without making it dumber.

Cost and quality are not as opposed as they first seem. A scanner that reads only the diff plus its blast radius is not just cheaper than one that reads the whole repo — it is often better, because it spends its limited attention on the code that actually changed instead of drowning in unchanged boilerplate. The art is cutting tokens you do not need while protecting the context you do.

Where the tokens actually go

Before optimizing anything, measure. In a typical Claude security agent the token budget splits across four buckets: the system prompt and tool definitions (sent every turn), the code the agent reads, the accumulated tool results in the running conversation, and the model's own reasoning and output. The first bucket is sneaky-expensive because it is constant per turn — a 4,000-token system prompt across a 30-turn review is 120,000 tokens before you have read a single line of code. The second bucket is the obvious one. The third is the silent killer: tool results pile up in context and you pay to re-send all of them on every subsequent turn.

Instrument your harness to log input and output tokens per turn, broken down by these buckets. You almost always find one dominating bucket, and that is where you optimize first. Optimizing the others is wasted effort.

Prompt caching: stop paying for the same prefix

Prompt caching is the highest-leverage lever for a security scanner, because so much of the input is identical across turns and across runs. Claude's prompt caching lets you mark a stable prefix — your system prompt, your security guidelines, your tool definitions, and any reference material like a list of known-dangerous APIs — so that repeated requests reuse the cached prefix at a steep discount instead of reprocessing it from scratch.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["New PR triggers scan"] --> B{"Diff in changed-files cache?"}
  B -->|Unchanged files| C["Reuse prior findings, skip"]
  B -->|Changed files| D["Build request with cached prefix"]
  D --> E{"Stable prefix cached?"}
  E -->|Hit| F["Reuse prompt cache, pay only new tokens"]
  E -->|Miss| G["Process full prefix, write cache"]
  F --> H["Batch independent files into parallel calls"]
  G --> H
  H --> I["Aggregate findings, emit report"]
  C --> I

The structural rule is: order your prompt from most stable to least stable. Put the unchanging security policy and tool definitions at the very front so they form a long cacheable prefix, and put the volatile per-file code at the end. If you interleave stable and volatile content, you fragment the cacheable region and lose most of the benefit. For a scanner that runs on every commit, a well-structured cached prefix can be the difference between an affordable tool and one finance asks you to turn off.

Batching: review independent files in parallel

Most files in a service are independent from a security standpoint — a vulnerability in the email-templating module rarely depends on the contents of the payment module. That independence is a license to parallelize. Instead of one long serial trajectory that reads file after file, fan the work out: spawn several subagents, each responsible for a bounded slice of the changed files, and let them run concurrently. Claude Code's parallel subagents are built for exactly this orchestrator-and-workers shape.

Batching wins on two axes at once. It cuts wall-clock latency because the slices run concurrently instead of in sequence, and it keeps each subagent's context short, which improves quality and lowers per-call cost. A subagent reviewing eight files stays sharp; a single agent grinding through eighty does not. The orchestrator's job is small — partition the changed files, dispatch the slices, and merge the findings — so it stays cheap even as the workers do the heavy lifting.

Scope the context to the diff and its blast radius

The biggest single saving is simply not reading what you do not need. On a pull request, the security-relevant universe is the changed lines plus their blast radius — the functions that call the changed code and the functions it calls. You rarely need the whole repository in context. Compute the diff, expand it to include immediate callers and callees, and feed the agent that focused slice rather than the entire tree.

Model choice is part of scoping too. Not every step needs your most capable model. Use a faster, cheaper model like Haiku for mechanical triage — "does this file even contain any sinks worth a closer look?" — and reserve a more capable model like Sonnet or Opus for the files that survive triage and demand real reasoning about exploitability. This tiered approach, sometimes called a model cascade, routinely cuts cost substantially because most files are boring and never need the expensive model. The cheap pass filters; the expensive pass reasons.

Cache findings, not just prompts

There is a second cache worth keeping: a content-addressed cache of findings keyed by file hash. If a file has not changed since the last scan, its security verdict has not changed either, so reuse the prior result and skip re-reviewing it. On a busy repository where most pull requests touch a handful of files, finding-level caching means you re-review only what actually changed, turning every-commit scanning from a luxury into a default. Invalidate a file's cached findings whenever its content hash changes, and invalidate everything when the agent's prompt or rules change, since a smarter scanner might catch what an older one missed.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Watch for the false economies

Optimization has failure modes of its own. Scope too aggressively and the agent misses a vulnerability that lives in the interaction between changed and unchanged code — an injection where the tainted input enters in a file you skipped. Cache too eagerly and you serve a stale clean bill of health after a dependency bumped underneath the code. Downgrade to a cheap model too broadly and subtle exploitability reasoning quietly degrades while your dashboards still look green. The discipline is to measure quality alongside cost: track findings against a labeled benchmark every time you tighten the budget, and only keep the optimization if the recall holds. Cheap and fast is the goal; cheap, fast, and blind is a regression wearing a cost-savings costume.

Frequently asked questions

What is prompt caching and how does it help a code scanner?

Prompt caching lets you mark a stable prefix of a request — system prompt, security rules, tool definitions — so repeated requests reuse the cached prefix at a large discount instead of reprocessing it. For a scanner that runs on every commit and re-sends the same policy every time, this is usually the single biggest cost saving available.

Should I review the whole repo or just the diff?

For per-pull-request scanning, review the diff plus its blast radius — the immediate callers and callees of the changed code — rather than the entire repository. This is cheaper, faster, and often more accurate because the agent's attention stays on code that actually changed instead of unchanged boilerplate.

Does using a cheaper model hurt security coverage?

It depends where you use it. A cheap model like Haiku is fine for mechanical triage — deciding which files even contain risky patterns — while a more capable model handles exploitability reasoning on the survivors. The danger is downgrading the reasoning step itself, so always validate recall against a labeled benchmark before committing to a cheaper model on that step.

How do I avoid serving stale security results from a cache?

Key your finding cache on each file's content hash and invalidate a file's cached verdict the moment its hash changes. Also invalidate the entire cache when the agent's prompt or rule set changes, since an improved scanner may catch issues the previous version missed.

Bringing agentic AI to your phone lines

CallSphere applies these same efficiency patterns — cached context, batched work, and right-sized models — to voice and chat agents that stay fast and affordable while handling every call and message around the clock. See it in action at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.