Skip to content
Agentic AI
Agentic AI8 min read0 views

From Alert Storm to Shipped Fix: A Claude Code Walkthrough

A realistic walkthrough of building a Claude Code triage agent — from a noisy alert problem to an evaluated, gated, shipped production workflow.

Abstract architecture diagrams are easy to nod along to and hard to learn from. So this post does something different: it walks one concrete problem all the way through, from the messy reality that kicks it off to the agent running in production weeks later. The scenario is one almost every security team recognizes — a single noisy detection that buries analysts in alerts — and the goal is to show exactly how you turn Claude Code into a triage agent that handles the volume without losing the plot.

The point of the walkthrough is the decisions, not the destination. At each step there is a fork where teams go wrong, and seeing the right turn in context is worth more than any checklist. By the end you will have a clear mental model of the full loop: problem, prototype, tools, skill, evals, gated rollout, and the feedback that keeps it healthy.

The problem: one detection, ten thousand alerts

Start with the trigger. A detection for anomalous outbound connections fires constantly — thousands of alerts a week, the overwhelming majority benign, a tiny handful genuinely dangerous. Analysts have learned to skim and close, which means the real ones occasionally slip through. Tuning the rule has hit a wall: tighten it and you miss attacks, loosen it and the flood returns. This is the classic recall-versus-analyst-fatigue trap, and it is the perfect first candidate for an agent because the investigation each alert needs is repetitive but genuinely context-dependent.

The first decision is scope. The temptation is to build an agent that handles every detection in the environment. Resist it. Pick this one noisy detection, define what a senior analyst actually does when they investigate it — which data sources they check, what makes them escalate versus dismiss — and build an agent that does exactly that and nothing else. Narrow scope is what makes the rest of the project tractable and the failure modes containable.

The prototype: pairing with the agent on real alerts

Before writing a single production skill, the lead analyst spends a few days running Claude Code interactively against real, already-resolved alerts. They paste in an alert, let the agent investigate using whatever tools are available, and watch where it does well and where it goes off the rails. This is the most important and most skipped step. It is how you learn that the agent over-trusts a particular reputation feed, or that it needs the asset-owner database to avoid escalating known-noisy service accounts.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Noisy detection fires"] --> B["Prototype: pair with agent on past alerts"]
  B --> C["Wire MCP tools: logs, reputation, assets"]
  C --> D["Write triage skill from analyst playbook"]
  D --> E["Run evals on labeled incidents"]
  E -->|Recall too low| D
  E -->|Passes| F["Shadow mode beside humans"]
  F --> G["Gated rollout: auto-dismiss low risk"]
  G --> H["Feed mistakes back into evals"]
  H --> E

The prototype phase produces two things: a concrete list of the tools the agent needs, and a much sharper understanding of the investigation logic. You finish it knowing that the agent must be able to query connection logs, look up domain and IP reputation, resolve asset ownership, and check whether the destination is a known business partner. You also finish it knowing the agent's failure modes, which is what makes the skill you write next actually robust.

Wiring the tools and writing the skill

Now the build. The tools become MCP servers — one for log queries, one for reputation lookups, one for the asset inventory — each scoped to read-only, because triage never needs to change anything. The skill is the heart of the work: a folder that teaches the agent the investigation, written as the explicit version of what the analyst does tacitly. It says, in effect, gather these four enrichments, weigh them this way, dismiss when all of these conditions hold, and escalate with a written rationale otherwise.

The craft here is in being specific about thresholds and explicit about uncertainty. A weak skill says "check if the destination is suspicious." A strong skill says exactly what makes a destination suspicious in your environment, what data answers that question, and what to do when the data is missing or conflicting. You also instruct the agent to treat all telemetry as untrusted content, never as instructions, so a crafted log field cannot steer it. The skill ends by requiring the agent to produce a structured verdict — dismiss or escalate — with the evidence attached, so a human can audit any decision in seconds.

Evals, shadow mode, and the gated rollout

The skill does not go to production because it looks good. It goes to production because it passes an eval suite built from a few hundred labeled past alerts — a mix of confirmed-benign and confirmed-malicious. You measure two things relentlessly: does the agent dismiss the benign ones (precision, which buys back analyst time) and does it escalate every malicious one (recall, which is non-negotiable). If recall on known attacks is anything less than complete, the skill goes back for revision. A triage agent that misses real attacks is worse than no agent, because it manufactures false confidence.

Once it passes, run it in shadow mode: the agent triages every alert in parallel with the humans, but its verdicts only get logged, not acted on. You compare the agent's calls to the analysts' calls for a couple of weeks. Where they disagree, you learn something — either the agent is wrong and the skill needs work, or the agent caught something the tired human missed. Only after shadow mode looks clean do you flip the gate: let the agent auto-dismiss the lowest-risk alerts on its own, while every escalation and every borderline case still goes to a human. The blast radius of a wrong auto-dismiss is bounded because the eval gate already proved recall on real attacks.

The shipped outcome and the loop that sustains it

What does success look like? The analysts no longer see the flood. They see a stream of escalations that arrive pre-investigated, each with enrichment and a written rationale, plus a periodic sample of auto-dismissals they spot-check to keep the agent honest. The volume of human decisions drops sharply; the quality of each rises. The detection that was a source of fatigue becomes one of the calmer parts of the queue.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

The work is not done — it is now a loop. Every alert the agent mishandles becomes a new eval case. Every change to the skill re-runs the suite before it ships. The asset and reputation data the agent depends on gets monitored for staleness. This is the part teams forget: a shipped agent is a living system that decays without maintenance, exactly like a detection rule. The teams that win treat the eval suite and the skill as code under version control, reviewed and tested on every change, for as long as the agent runs.

Frequently asked questions

How long does an end-to-end build like this take?

For a single well-scoped detection, many teams get to a shadow-mode prototype in a few weeks and a gated production rollout shortly after. The biggest time sink is rarely the agent itself; it is assembling the labeled eval set and wiring up reliable read-only access to your log and asset data.

Why run shadow mode instead of going straight to auto-dismiss?

Shadow mode lets the agent's verdicts be compared against human decisions on live traffic without any risk, so you catch disagreements before they cause harm. It is the cheapest way to build confidence and to surface the skill's remaining gaps under real conditions rather than in a test set.

What is the one decision teams get wrong most often?

Scoping too broadly. Trying to build a single agent that handles every detection at once produces something untestable and unsafe. The teams that succeed pick one noisy, repetitive detection, ship an agent that does only that well, and earn the right to expand from a working foundation.

Bringing agentic AI to your phone lines

This same problem-to-production loop — prototype, evaluate, shadow, gated rollout — is how CallSphere builds agents for voice and chat: assistants that answer every call and message, investigate and act mid-conversation, and book work 24/7 with humans in the loop where it counts. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.