An Agentic Security Defense Walkthrough, End to End
A realistic problem-to-shipped build of a Claude phishing triage agent: scoping, tools, skills, evals, shadow rollout, and results.
Most writing about agentic defense stays at the altitude of principles. This post stays on the ground. I want to walk through one realistic end-to-end build — a phishing triage and response agent for a mid-sized company drowning in user-reported emails — from the original problem to the shipped, supervised system. The point is to show the actual decisions: what to scope in, what tools to grant, how to test it, and how to know it is working. The specifics are illustrative, but the shape is exactly how these projects go.
The problem we started with
The security team received roughly two hundred user-reported phishing emails a day through a report-phish button. Each one needed someone to open it, check the sender, follow the links in a sandbox, decide if it was malicious, and — if it was — pull the same email from every other inbox before someone clicked. One analyst spent most of a shift on this, the queue still ran a day behind, and the lag mattered: the dangerous emails were the ones that sat unreviewed while people clicked.
The goal was not to remove humans. It was to compress the rote ninety percent — the obvious spam and the obvious benign newsletters — so the analyst's day went to the genuinely ambiguous ten percent and to the malicious campaigns that needed fast containment. That framing shaped every later decision: the agent's job was triage and recommendation, with a tight human gate on anything destructive.
Scoping the agent and choosing its tools
We built the agent on the Claude Agent SDK and connected it to the environment through MCP servers, each exposing exactly one capability we had reasoned about. The read-side tools came first: fetch the reported email, look up the sender's domain reputation, detonate URLs in an existing sandbox and read back the verdict, and check whether the same email had landed in other mailboxes. Every one of these is read-only. A mistake here costs nothing but a wrong recommendation, which a human still reviews.
The write-side capability — removing a confirmed-malicious email from all inboxes — was deliberately kept out of the triage agent. That action has a real blast radius: yank the wrong campaign and you delete a legitimate companywide announcement. So remediation went behind a human approval gate. The agent produces a recommendation with its evidence; an analyst clicks approve; only then does a separate, tightly scoped remediation tool execute, and even then with an audit log of exactly what was purged.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["User reports email"] --> B["Agent fetches\nmessage & headers"]
B --> C["Enrich: domain rep,\nURL sandbox, spread"]
C --> D{"Verdict?"}
D -->|Benign| E["Close ticket\n+ notify reporter"]
D -->|Spam| F["Close + log"]
D -->|Malicious| G["Draft remediation\n+ evidence"]
G --> H{"Analyst approves?"}
H -->|Yes| I["Purge from all\ninboxes + audit"]
H -->|No| J["Route to human\ninvestigation"]Teaching it the runbook with a Skill
The team's phishing knowledge lived in a senior analyst's head and a stale wiki page. We turned that into an Agent Skill: a folder of instructions describing how to weigh sender reputation against URL verdicts, which internal domains are trusted, what the known recurring false positives are (the marketing platform that always looks suspicious), and the exact format for a remediation recommendation. The skill is what makes the agent reason like our analyst rather than like a generic spam filter. When a reported email arrives, Claude loads the skill, applies the same judgment the senior analyst would, and shows its work.
Capturing the runbook as a skill had a side benefit we did not expect: it forced the team to write down judgment that had never been explicit. Edge cases that used to be resolved by tapping the senior analyst on the shoulder became documented rules. The skill became living documentation that improved every time a human corrected the agent.
Testing it before trusting it
We did not ship until the agent passed an eval suite built from real history. We pulled six months of resolved phishing tickets — known-malicious, known-benign, and the genuinely ambiguous ones — and ran the agent against them, comparing its verdicts to the analysts' final calls. The first pass was sobering: the agent over-flagged the marketing platform and under-weighted lookalike domains. Both were fixable in the skill, and the eval caught them before a single production email was touched.
Crucially, the eval suite included adversarial cases. We added emails with injection attempts in the body — text trying to instruct the agent to mark itself benign — and confirmed the agent's authoritative instructions held against attacker-controlled content. An agent that reads attacker text and has not been tested against injection is not ready, full stop. We set a release gate: the agent had to hit an agreed accuracy bar on the benign and malicious sets and a perfect score on the injection set before it could run on live mail.
Rollout, supervision, and results
We rolled out in shadow mode first — the agent triaged every reported email and posted its recommendation, but humans still made every final call. For two weeks we compared agent recommendations to human decisions, fed every disagreement back into the skill and the evals, and watched the agreement rate climb. Only once the agent matched the analysts on the easy bands did we let it auto-close obvious spam and benign mail, while still routing every malicious verdict to a human for the remediation approval.
The outcome matched the original goal. The analyst's queue stopped running a day behind because the obvious ninety percent cleared automatically with a full audit trail. The human time went to the ambiguous emails and to fast containment of real campaigns — the agent surfaced one credential-harvesting campaign spread across forty inboxes within minutes of the first report, and the analyst approved the purge before the second click. The win was not headcount; it was speed where speed prevents harm, and human attention freed for the judgment calls that actually need it.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Frequently asked questions
How long does a build like this take?
The agent and tools are often the fast part — days to a couple of weeks. The slower, more valuable work is capturing the runbook as a skill, building an eval suite from real history, and running shadow mode long enough to trust the agent. Budget for the testing and supervision phase, not just the build.
Why keep remediation behind a human gate?
Because purging an email from every inbox has real blast radius — pull the wrong campaign and you delete a legitimate companywide message. Read and enrichment actions are reversible and low-risk, so the agent runs them freely; destructive, wide-reaching actions stay behind an explicit human approval.
What if the agent gets a verdict wrong?
In shadow mode and for malicious verdicts, a human reviews before anything happens, so a wrong verdict is caught and becomes a new eval case. For auto-closed bands, you sample outputs continuously and watch the false-negative rate; any miss tightens the skill and the tests.
Does this generalize beyond phishing?
Yes. The pattern — scope a narrow problem, grant read-only tools, capture the runbook as a skill, gate destructive actions, prove it with evals, ship in shadow mode — applies to alert triage, vulnerability enrichment, and incident summarization. Phishing is just a clean first project.
Bringing agentic AI to your phone lines
This same end-to-end pattern — scoped tools, a captured runbook, evals, and a human gate on consequential actions — is how CallSphere builds voice and chat agents that triage and act in real conversations. See the live system at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.