Skip to content
Agentic AI
Agentic AI8 min read0 views

A skill-equipped Claude agent: problem to shipped

A realistic end-to-end walkthrough of building a Claude agent with Skills — from a messy support backlog to a deployed, trusted, production agent.

Most writing about agents stays at the level of architecture diagrams. This post does the opposite: it follows one realistic project from the day someone said "we have a problem" to the day a skill-equipped Claude agent was quietly resolving real work. The domain is deliberately ordinary — an inbound support and scheduling queue for a mid-sized service business — because the ordinary case is where most teams will actually build, and where the unglamorous details decide success.

I will keep the company anonymous and the numbers generic, but the shape is true to how these projects go. The point is not the specific outcome; it is the sequence of decisions, the places it nearly went wrong, and what "shipped" actually meant.

The problem, stated honestly

The starting situation was a backlog. Inbound requests arrived through several channels, a small team triaged them by hand, and the same dozen request types accounted for the overwhelming majority of the volume: rescheduling, basic account questions, status checks, and intake for new work. The team was not slow because the work was hard. They were slow because it was repetitive and never-ending, and the genuinely tricky cases waited behind a wall of routine ones.

The wrong framing would have been "build an AI that handles support." That is unbounded and untestable. The framing that worked was narrower: automate the handful of request types that are high-volume, low-judgment, and reversible, and route everything else to a human with a clean summary. That single sentence shaped every decision that followed, because it defined what the agent should and should not touch.

Designing the skills, not the agent

The team resisted the urge to build one giant prompt. Instead they decomposed the routine work into discrete skills, each a folder Claude loads when the situation calls for it: a rescheduling skill, an account-lookup skill, a status-check skill, and an intake skill. Each skill bundled its own instructions and the small scripts that turned an instruction into a concrete action against the booking and account systems.

flowchart TD
  A["Inbound request"] --> B{"Request type recognized?"}
  B -->|No| C["Summarize & route to human"]
  B -->|Yes| D["Load matching skill"]
  D --> E["Run skill script via scoped tool"]
  E --> F{"Action within limits?"}
  F -->|No| C
  F -->|Yes| G["Execute & confirm to customer"]
  G --> H["Log outcome for eval"]

The design insight here is that the agent itself stayed thin. Its job was recognition and routing; the procedures lived in skills, and the irreversible operations lived behind scoped tools with limits. The thinness was deliberate and load-bearing: a thin agent with thick, reviewable skills is far easier to reason about than a thick agent that hides its procedures in a sprawling prompt. This separation is what made the system reviewable. A domain expert could read the rescheduling skill and confirm it matched the real policy, without reading any agent code at all.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Choosing the boundaries between skills took real thought. The team's first instinct was to merge account-lookup and status-check into one skill, since they touched the same system. They split them after realizing the two had different risk profiles: a lookup that exposed the wrong field was a privacy issue, while a status check was harmless. Different risk profiles deserve different skills, because they deserve different review scrutiny and different action limits. That principle — let risk, not convenience, draw the skill boundaries — saved them repeatedly as the system grew.

The first build, and where it nearly broke

The first working version handled the happy path well and the edges badly. The rescheduling skill assumed a slot was always available; when none was, the agent improvised and offered a time that did not exist. The account-lookup skill returned data correctly but occasionally surfaced a field it should not have, because the skill author had not specified what not to share. These were not model failures. They were specification gaps — the skills were silent on cases the humans handled by instinct.

Fixing them was a matter of going back to the skill files and making the implicit explicit: what to do when no slot exists, exactly which fields are safe to share, when to stop and escalate. Each fix was a small edit to a versioned file, reviewed like code and re-run against the growing eval set. Within a couple of weeks the edge cases that mattered were covered, and the ones that did not were routed cleanly to a human.

A subtle lesson emerged from this phase: the agent was at its most dangerous not when it failed loudly but when it succeeded plausibly. An offered time slot that did not exist looked like a normal, helpful response right up until the customer tried to use it. The team learned to design every skill so that its actions were verifiable against a source of truth before being confirmed — the booking skill checked real availability, the lookup skill checked field-level permissions — rather than trusting the model's fluent output. Plausibility is not correctness, and the skills that lasted were the ones that closed that gap explicitly.

Evals, shadow mode, and the canary

Before the agent touched a real customer, it ran in shadow mode against historical requests. For each past ticket, the agent proposed what it would do, and the team compared that to what actually happened. This surfaced disagreements cheaply — every place the agent and the human diverged was either a bug in the skill or a case the agent should not own. The eval set grew out of these disagreements and became the gate every future skill change had to pass.

Only then did the agent go live, and even then on a small slice: a fraction of inbound volume, tight action limits, and a kill switch on every skill. The team watched the first live actions one by one. The confidence to widen the slice came not from the demo but from watching real resolutions land correctly, day after day, with the audit log showing exactly what happened each time.

What "shipped" actually meant

Shipped did not mean the agent handled everything. It meant the routine, reversible request types were resolved automatically end to end, the genuinely tricky cases arrived at a human pre-summarized and pre-triaged, and the whole thing was observable and reversible. The team that used to drown in routine now spent its time on the exceptions — which is exactly where human judgment is worth paying for.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

The lasting lesson from the project was that the hard part was never the model. The model was capable from day one. The work was the narrowing of scope, the patient encoding of procedures into skills, the eval set built from real disagreements, and the staged rollout that let the unknown cases surface safely. Get those right and the agent ships. Skip them and you get an impressive demo that nobody trusts in production.

It is worth dwelling on how the team's relationship to the agent changed over the project. Early on they treated it as a system to be impressed by; by the end they treated it as a junior teammate to be supervised, corrected, and gradually trusted with more. That shift in posture — from spectacle to colleague — was itself a deliverable. It is what let them widen scope responsibly instead of either over-trusting the agent into an incident or under-trusting it into uselessness. The agents that ship are the ones whose teams learn to supervise them well, and that learning is as much a part of the project as any line of skill instruction.

Frequently asked questions

How long did a project like this take?

Realistically a few weeks to a first live slice and a couple of months to broad rollout. Most of that time went into specifying edge cases and building the eval set, not into the initial build, which came together quickly.

Why split the work into multiple skills instead of one agent prompt?

Separate skills are independently reviewable, testable, and reversible. A domain expert can verify the rescheduling procedure without touching anything else, and a bad change to one skill cannot silently affect the others. One giant prompt is harder to reason about and far harder to roll back.

What made the rollout safe?

Shadow mode against historical data, a small live canary slice with tight action limits, per-skill kill switches, and a full audit log. Each stage answered a different question before the agent earned more scope.

What should the agent never have done?

Anything irreversible or high-judgment without a human. The agent recognized and routed those cases rather than attempting them, which kept the blast radius small and kept human attention on the work that actually needed it.

Bringing agentic AI to your phone lines

CallSphere takes this same problem-to-shipped path for voice and chat — agents that recognize routine requests, run scoped procedures, and book work while escalating the hard cases cleanly. See a live build at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.