Skip to content
Agentic AI
Agentic AI7 min read0 views

Skills in action: shipping a refund agent with Claude

An end-to-end walkthrough of building a Claude refund agent with Skills, MCP, subagents, and projects — from messy problem to safely shipped automation.

Most explanations of Agent Skills stay abstract: skills teach Claude how to use tools, MCP connects the tools, projects hold the context, subagents divide the work. All true, and all hard to feel until you watch the pieces solve one concrete problem from start to finish. So let's build something real and trace every decision: an automated refund-handling agent for a mid-size e-commerce support team that is drowning in repetitive tickets. We'll go from the messy starting state to a shipped, monitored system, and we'll see exactly where each capability earns its place.

The point isn't the refund domain specifically — it's the shape of the work. The same arc applies to onboarding agents, deploy assistants, or claims triage. What matters is the sequence of decisions and how Skills, Projects, MCP, and subagents fit together rather than compete.

The problem we're actually solving

The support team handles a few hundred refund requests a week. Most are boringly similar: a customer asks for a refund, an agent checks order status, verifies it falls within policy, issues the refund, and replies. But a meaningful minority are not boring — partial refunds, items outside the return window, suspected fraud, duplicate charges. The team's real pain is that the simple cases steal the time and attention that the tricky cases deserve. Human agents burn out on copy-paste work and then make mistakes on the cases that need judgment.

The goal, stated precisely, is to let Claude handle the clearly-in-policy refunds end to end, escalate everything ambiguous to a human with a clean summary, and never issue a refund that violates policy. That precision matters: a vague goal like "automate refunds" produces a vague, unsafe agent. A sharp goal tells us exactly where the autonomy boundary sits.

Designing the system: who does what

We start by mapping each capability to a job. The Project is the persistent workspace: it holds the refund policy document, the tone-of-voice guide, and the running context for this body of work, so Claude doesn't relearn the basics every session. MCP servers give Claude scoped access to the order system (read-only) and the payment system (a narrow, audited refund endpoint — not full payment access). The Skill encodes the actual procedure: how to read a request, what policy checks to run in what order, when to escalate, and how to write the reply. And a subagent takes on the fraud-signal check as its own focused task, because that reasoning benefits from a dedicated context window and a different framing than the main flow.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Refund request arrives"] --> B["Skill loads policy from Project"]
  B --> C["Read order via read-only MCP"]
  C --> D{"Clearly in policy?"}
  D -->|No| E["Escalate to human with summary"]
  D -->|Yes| F["Subagent runs fraud-signal check"]
  F -->|Suspicious| E
  F -->|Clean| G["Call scoped refund endpoint"]
  G --> H["Draft & send reply in brand voice"]
  H --> I["Log full trace for audit"]

Notice what this design refuses to do. The payment MCP connection can issue a refund through one audited endpoint and nothing else — it can't move money anywhere it likes. The order connection is read-only. The skill's escalation rule is biased toward caution: when in doubt, hand to a human. We've turned the precise goal into structural constraints, so even a confused agent stays inside safe bounds.

Writing the skill that drives it

The skill is where the procedure lives, and it reads like a runbook a careful new hire could follow. It specifies the order of operations explicitly: first read the order, then check the return window against the order date, then check whether the amount matches the original charge, then check for duplicate-refund history, and only then proceed. Each check has a clear escalate-on-doubt branch. The skill also tells Claude when not to act — if any required field is missing, if the order can't be found, if the amount is partial — because the cheapest failures to prevent are the ones the procedure explicitly forbids.

Crucially, the skill delegates the fraud check rather than inlining it. The main flow says, in effect, "hand the order and customer history to the fraud-signal subagent and wait for a clean/suspicious verdict." That keeps the main procedure readable and lets the fraud reasoning have its own focused context, which is exactly the kind of judgment task that degrades when crammed alongside everything else.

Rolling it out without scaring anyone

We don't flip this to full autonomy on day one. The first week runs in shadow mode: Claude processes every refund request and produces a full recommendation — refund or escalate, with reasoning and a drafted reply — but a human executes the actual action. This does two things. It builds a clean comparison between the agent's judgment and the team's, and it lets the support reps build trust by watching the agent get the easy cases right repeatedly.

Once shadow mode shows the agent agreeing with humans on the clear cases and escalating the genuinely ambiguous ones, we move to human-in-the-loop: the agent acts on the obvious refunds but still pauses at the refund call for a one-click human confirmation. Only after that proves reliable do we let the clearly-in-policy refunds run fully autonomously, while every ambiguous case keeps routing to a person. The escalation path never gets automated away — that's the whole safety story.

The shipped outcome and what made it work

The result is that the simple refunds stop consuming human attention, and the support team's time concentrates on the cases that need judgment — the partials, the disputes, the fraud signals. The agent isn't smarter than the reps; it's tireless and consistent on the part of the job that rewarded neither. That redistribution of attention is the actual win.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

What made it work was not any single capability but the division of labor. The Project held durable context. MCP gave scoped, auditable reach into real systems. The Skill turned policy into an explicit, testable procedure. The subagent isolated the one task that needed dedicated reasoning. And the staged rollout converted a scary all-or-nothing launch into a series of small, reversible steps. Skip any one of those and the system gets either unsafe or unconvincing.

Frequently asked questions

Why use a subagent for the fraud check instead of one big skill?

Because the fraud judgment benefits from its own focused context and framing. Inlining it would bloat the main procedure and dilute the reasoning. Delegating to a subagent keeps the main flow readable and gives the hard sub-problem the attention it needs, at the cost of some extra tokens — a trade worth making for a judgment-heavy step.

How did MCP keep the money side safe?

The payment MCP connection exposed a single audited refund endpoint, not general payment access, and the order connection was read-only. So even if the agent misjudged a case, the worst it could do was issue one refund through a logged, constrained path — never move money arbitrarily. Scoping the tools capped the blast radius structurally.

Couldn't a single clever prompt do this?

For a one-off, maybe. But a prompt is ephemeral and personal, while this procedure runs hundreds of times a week and touches money. Encoding it as a Skill makes it reusable, reviewable, and versioned, and pairing it with scoped MCP tools and staged rollout gives it the safety a raw prompt can't. The durability is the point.

What's the most important design decision in this build?

Biasing escalation toward caution. Because the skill hands every ambiguous case to a human and only automates the clearly-in-policy refunds, the system can be aggressive about saving time on easy cases while staying conservative where it counts. A sharp autonomy boundary, baked into the procedure, is what makes the whole thing shippable.

From refund tickets to ringing phones

CallSphere assembles these same pieces — scoped tools, encoded procedures, and delegated subagents — for voice and chat, so an assistant can verify, decide, and book work mid-conversation. Watch the end-to-end version answer real calls at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.