Evals for an LLM Code Security Agent: Gating Releases

Every team that builds an LLM source-code security agent eventually hits the same wall: they made a change to the prompt, it feels better on the one example they tried, and they have absolutely no idea whether it is actually better across the thousand cases they did not try. Without evals, you are tuning by vibes, and vibes do not catch the regression where your new prompt now misses authentication bypasses while finding more cross-site scripting. This post is about building the eval loop that turns a security agent from a clever demo into a system you can confidently change and ship.

Evals matter more for security than for almost any other agent task, because the failure is asymmetric and invisible. A missed vulnerability does not raise an error or look wrong; it looks exactly like a clean review. The only way to know your agent caught the bugs is to run it against code where you already know the answer and measure.

What you are actually measuring

A code security eval is fundamentally a classification measurement, and the two numbers that matter are precision and recall. Recall is the fraction of real vulnerabilities the agent found — miss too many and the tool is dangerous, because it ships exploitable code with a green check. Precision is the fraction of the agent's findings that are real — too low and engineers drown in false positives, stop reading the output, and the tool dies of irrelevance. The two trade off, and where you set that balance is a product decision, not a technical one.

For a blocking pre-merge gate you usually want high precision so you do not cry wolf and erode trust. For a deep nightly audit you can tolerate lower precision in exchange for higher recall, because a human triages the results anyway. Decide which regime each deployment is in before you start tuning, because optimizing the wrong metric makes the agent worse for its actual job.

Building the benchmark

An eval is only as good as its labeled dataset, and for security that dataset has a particular shape. You need code samples with known vulnerabilities, ideally drawn from three sources: synthetic examples you write to cover specific categories like SQL injection and path traversal, real historical vulnerabilities from your own codebase pulled from past security fixes, and crucially, clean code that contains no vulnerabilities at all. That last category is what measures precision — an agent that flags everything has perfect recall and is useless, and only clean samples expose it.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Agent change proposed"] --> B["Run agent over labeled benchmark"]
  B --> C["Compare findings to ground-truth labels"]
  C --> D{"Recall >= threshold?"}
  D -->|No| E["Block release: missed vulns"]
  D -->|Yes| F{"Precision >= threshold?"}
  F -->|No| G["Block release: too noisy"]
  F -->|Yes| H["Check no known-good regressed"]
  H -->|Regression| E
  H -->|Clean| I["Promote new agent version"]

Label each sample with the specific vulnerabilities present — file, line, and category — so you can grade findings precisely rather than just counting them. Keep the benchmark in version control next to the agent, and grow it every time the agent misses something in production: each real miss becomes a new permanent test case, so the same bug can never silently slip through twice. Over time this corpus becomes one of your most valuable assets, encoding hard-won knowledge of exactly where your agent fails.

Grading: matching findings to ground truth

Grading a security eval is harder than grading a math problem because findings are fuzzy. The agent might report the right vulnerability at a slightly different line, or describe it in different words than your label. So your grader needs a matching policy: a finding counts as a true positive if it identifies the same vulnerability class within a small line-number window of a labeled bug. For the fuzzier semantic cases, an LLM-as-judge can decide whether the agent's description and a ground-truth label refer to the same underlying flaw — Claude is well suited to this adjudication role because it can reason about whether two descriptions mean the same thing.

Be disciplined about the judge, though. An LLM judge is itself a model that can be wrong, so validate it against human-labeled disagreements periodically and keep its rubric tight and specific. A vague "is this a good finding?" prompt produces a noisy judge; a precise "do these two descriptions refer to the same vulnerability at the same location?" prompt produces a reliable one. The judge is infrastructure, and infrastructure gets tested too.

Gating releases with the eval loop

The payoff of all this measurement is a gate. Every proposed change to the agent — a new prompt, a new tool, a model upgrade — runs against the full benchmark before it can ship, and it ships only if recall stays above your floor, precision stays above your floor, and no previously-caught vulnerability regressed to a miss. That last check is the one teams skip and regret: a change can lift your aggregate numbers while quietly breaking a category you used to handle, and only a per-case regression check catches it.

Wire this into CI so the gate is automatic and unskippable. A pull request that modifies the agent triggers the eval suite, and a failing suite blocks the merge exactly like a failing unit test would. This is the same release discipline you apply to ordinary code, applied to a probabilistic system — and it is what lets you iterate on the agent quickly without fear, because the benchmark catches the regression you did not anticipate. Without the gate, every prompt tweak is a gamble; with it, every tweak is a measured experiment.

The pitfalls that quietly ruin evals

Eval suites rot in predictable ways. The first is a benchmark too small to be meaningful — twenty samples give you noise, not signal, and you need enough cases per vulnerability category to trust the per-category numbers. The second is overfitting: if you tune the agent against the same fixed benchmark forever, you eventually optimize for the test rather than for real security, so refresh the corpus with genuinely new cases regularly. The third is letting the labels drift from reality — a sample mislabeled as clean when it actually hides a bug will train you to accept a real miss. Audit the labels periodically. An eval suite is a living thing; neglected, it slowly stops measuring what you think it measures, and a gate that no longer reflects reality is worse than no gate, because it grants false confidence.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Frequently asked questions

What is the difference between precision and recall here?

Recall is the fraction of real vulnerabilities the agent actually found; low recall means it ships exploitable code with a clean verdict. Precision is the fraction of the agent's reported findings that are genuine; low precision means engineers drown in false positives and stop trusting the tool. The two trade off, and the right balance depends on whether the agent is a blocking gate or a triaged audit.

Why include clean code with no vulnerabilities in the benchmark?

Clean samples are what measure precision. An agent that flags everything has perfect recall and is useless, and only code known to be vulnerability-free reveals how often it cries wolf. Without clean cases, your benchmark rewards over-flagging and hides the false-positive problem that actually kills adoption.

Can I use an LLM to grade my security evals?

Yes — an LLM-as-judge works well for the fuzzy task of deciding whether a finding and a ground-truth label describe the same vulnerability, and Claude handles this adjudication reliably. Keep the judge's rubric tight and specific, and periodically validate it against human-labeled disagreements, because the judge is itself a model that can be wrong.

How do I stop a change from silently regressing one category?

Add a per-case regression check to your gate: beyond aggregate precision and recall, confirm that no vulnerability the previous version caught is now missed. Aggregate metrics can improve while a specific category breaks, so only a case-by-case comparison catches that failure before it ships.

Bringing agentic AI to your phone lines

CallSphere runs the same eval-gated release discipline behind its voice and chat agents, measuring quality on labeled conversations before any change reaches a live call. Explore it at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Evals for an LLM Code Security Agent: Gating Releases

What you are actually measuring

Building the benchmark

Grading: matching findings to ground truth

Gating releases with the eval loop

The pitfalls that quietly ruin evals

Frequently asked questions

What is the difference between precision and recall here?

Why include clean code with no vulnerabilities in the benchmark?

Can I use an LLM to grade my security evals?

How do I stop a change from silently regressing one category?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Code GTM engineering is heading next

Where Claude Cowork is heading and how to prepare

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild