Build a Claude Code Security Review Agent Step by Step
A step-by-step build for a Claude security reviewer that scans diffs, traces taint with tools, verifies in a sandbox, and gates pull requests in CI.
Reading about LLM security agents is one thing; standing one up that actually blocks a risky pull request is another. This post is the build. By the end you will understand the moving parts of a Claude-based security reviewer that runs on every diff, follows untrusted data through real call paths, verifies what it finds, and writes a pass/fail comment back to the PR. I will keep the scope to a single service and a single language so the steps stay concrete, and I will be explicit about the decisions that trip people up.
Step 1: Scope the agent to the diff, not the repo
The first decision is what the agent reviews. Reviewing the entire repository on every push is slow and expensive and produces the same findings over and over. Instead, drive the agent off the pull-request diff. Compute the changed hunks with git diff origin/main...HEAD, then expand each hunk into the full enclosing function plus a few lines of surrounding context. Changed lines are your seeds; the agent's job is to decide whether any of those changes introduces or exposes a security-relevant path.
This scoping has a nice property: it bounds cost to the size of the change, and it focuses attention where risk was just introduced. A new route handler, a new query, a new deserialization call — those are exactly the diffs worth deep review, and they surface naturally because they appear as added lines.
Step 2: Give the agent tools, not just text
A reviewer that can only see the diff is blind to the code the diff calls into. So before the loop starts, register a small toolset. At minimum you want search_code(query) for symbol and string search, read_definition(symbol) to fetch a function or class body, find_callers(symbol) to walk up toward trust boundaries, and run_in_sandbox(script) to execute a proof. With the Claude Agent SDK you declare each tool with a name, a JSON schema for its arguments, and a handler; the model then chooses when to call them.
flowchart TD
A["CI trigger on PR"] --> B["git diff: changed hunks"]
B --> C["Expand to full functions"]
C --> D{"Claude: any tainted path here?"}
D -->|Trace data| E["find_callers / read_definition"]
E --> D
D -->|Suspected bug| F["run_in_sandbox: PoC test"]
F -->|Reproduces| G["Post PR comment, fail check"]
F -->|No repro| H["Note as low-confidence, pass"]
The order in the diagram matters. The agent does not jump straight to running code; it first decides whether a changed line plausibly handles untrusted input, then traces the data with read and caller tools, and only spends a sandbox execution once it has a specific hypothesis worth testing. That ordering keeps the loop cheap on the common case where a diff is benign.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3: Write a system prompt that defines the job, not the answer
The system prompt is where most homegrown reviewers go wrong. Do not enumerate vulnerability classes as a checklist and ask the model to grep for them — that recreates a pattern scanner with worse latency. Instead, define the method: "For each changed region, identify whether it sits on a path from an external input to a sensitive operation. Use the tools to confirm the path is reachable. Before reporting, attempt to reproduce the issue with run_in_sandbox. Report only issues you reproduced or can specify a precise exploit for." Then state the output contract: a JSON array of findings, each with a file, a line, a tainted-path description, a severity, and either a proof or an explicit reason it could not be proven.
Give the model a few worked examples of the difference between a real finding and a non-finding — for instance, raw string interpolation into SQL versus the same shape going through a parameterized API. Examples calibrate judgment far better than rules, and they cut the false-positive rate sharply.
Step 4: Run the loop with a budget and a sandbox
Now wire the loop. The Agent SDK runs Claude, surfaces tool calls, executes your handlers, and feeds results back until the model emits its final findings. Two guardrails are non-negotiable. First, a budget: cap tool calls and total tokens per PR so a pathological diff cannot run forever. Second, isolation: run_in_sandbox must execute in a throwaway container with no network egress, no production credentials, and a hard timeout, because you are letting a model run code it just wrote. Treat that sandbox as hostile by construction.
A practical detail: capture the full transcript — every tool call, argument, and result. You will need it both for the PR comment (so a human can see how a finding was reached) and for debugging when the agent is wrong. Persist it as an artifact keyed by commit SHA.
Step 5: Turn findings into a CI gate
The final step is policy. Map severities to an action: high-severity verified findings fail the check and block merge; medium findings post a comment but pass; low-confidence or unverified notes are collapsed so they do not nag. Post the result as a single PR comment with each finding's tainted path, severity, and reproduction, and set the commit status accordingly. Crucially, make the gate deterministic about verified findings only — never block a merge on an unproven hypothesis, or developers will learn to ignore the bot.
Add an escape hatch: a labeled override that a security reviewer can apply to merge despite a finding, which records who overrode and why. This keeps the agent from becoming a hard wall that teams route around, while preserving an audit trail.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
What to watch once it is live
Two failure modes show up in week one. Latency creep, when the agent over-uses the sandbox on benign diffs — fix it by tightening the prompt's "only run a proof once you have a specific hypothesis" instruction and lowering the tool budget. And drift in the false-positive rate after a framework upgrade, because the model's assumptions about auto-escaping or parameterization go stale — fix it by keeping framework facts in the retrieved config slice rather than in the model's head. Track precision on a fixed set of known-good and known-bad PRs so you notice regressions before your developers do.
Frequently asked questions
Why review the diff instead of the whole codebase?
Diff-scoped review bounds cost to the size of the change and concentrates attention where risk was just introduced. Whole-repo scans repeat the same findings, run slowly, and bury new issues in old noise. You expand each changed hunk to its enclosing function so the agent still has enough local context to reason.
Is it safe to let the agent run code it wrote?
Only inside a hardened sandbox: a throwaway container with no network egress, no real credentials, and a hard timeout. The reproduction step is what eliminates false positives, but the code being executed is model-generated, so isolation is mandatory rather than optional.
How do I keep developers from ignoring the bot?
Block merges only on verified, high-severity findings and present every blocking finding with a concrete reproduction. Collapse low-confidence notes, and provide a logged override label for security reviewers. A bot that blocks on guesses gets routed around; one that blocks only on proven exploits earns trust.
Bringing agentic AI to your phone lines
CallSphere applies the same build-loop-verify approach to voice and chat: agents that take an inbound call, use tools to check a calendar or CRM mid-conversation, and confirm before booking — running 24/7 with the same discipline a good CI gate uses. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.