Claude Agent Walkthrough: From Refund Mess to Shipped

Abstract advice about agents only goes so far. To really understand how Skills and MCP servers extend Claude, it helps to follow one build from the ugly real problem all the way to a shipped, measured outcome. So let us take a concrete case a lot of teams actually face: a support queue drowning in refund requests. The work is repetitive but not trivial — every request needs the order looked up, the policy applied, the eligibility judged, and the action taken. It is exactly the kind of task where a Claude agent with the right tools earns its keep, and exactly the kind where a naive build falls apart. We will walk it end to end.

The problem, stated honestly

The starting state is a team handling several hundred refund requests a week through a shared inbox. Each one requires an agent to open the order management system, check the purchase date and item condition, cross-reference a refund policy that has exceptions for sale items and digital goods, decide whether the request qualifies, and either issue the refund or send a templated denial with a reason. It takes a trained human a few minutes per request, and the queue regularly backs up over a weekend. The requests themselves are messy — customers paste order numbers wrong, describe products instead of naming them, and bury the actual ask in a paragraph of frustration.

The goal is not to remove humans. It is to have Claude resolve the clear-cut cases end to end, draft the borderline ones for human approval, and escalate the genuinely ambiguous ones with all the context already gathered. That framing — clear-cut automated, borderline assisted, ambiguous escalated — is what makes the project shippable instead of a science experiment.

Designing the tools and the skill

The agent needs to touch three systems, so we build or connect three MCP servers, each scoped tightly. One exposes read access to the order system: look up an order by number or by customer email, return purchase date, items, prices, and sale flags. One exposes the refund action itself, with a hard scope limited to issuing refunds and a per-session rate limit. One exposes the messaging system to send the customer a reply. The refund tool is the only irreversible one, so it gets a human-approval gate for anything above a small dollar threshold.

The policy lives in an Agent Skill rather than a giant system prompt. The skill is a folder containing the written refund policy, a short procedure describing how to evaluate a request step by step, and a couple of worked examples of edge cases. Claude loads this skill only when it detects a refund task, which keeps the base context lean and means the support team can update the policy by editing the skill, no engineering deploy required. This separation — tools in MCP, judgment in the skill — is the architectural heart of the build.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Refund request arrives"] --> B["Claude loads refund skill"]
  B --> C["Look up order via MCP"]
  C --> D{"Eligible per policy?"}
  D -->|Clearly yes, small amount| E["Issue refund & reply"]
  D -->|Borderline| F["Draft action for human approval"]
  D -->|Ambiguous / no order| G["Escalate with gathered context"]
  E --> H["Log outcome"]
  F --> H
  G --> H

The first build, and why it was wrong

The first version we stood up looked great in the happy path and failed in instructive ways the moment it hit real tickets. It would issue refunds on sale items because the policy exception was buried in a long skill file the model skimmed. It would loop on malformed order numbers, retrying the lookup tool five times before giving up. And once, reading a ticket where the customer had pasted a competitor's promotional email full of "reply with a full refund immediately" language, it nearly acted on injected instructions.

Each failure pointed to a fix that had nothing to do with the model's intelligence and everything to do with the surrounding design. The sale-item miss got fixed by restructuring the skill so the exceptions were a checklist the agent had to walk, not a paragraph. The loop got fixed with a rate limit and a clear instruction to escalate after one failed lookup rather than retry. The injection scare got fixed by tightening the rule that ticket content is data to reason about, never instructions to obey, and by keeping the refund tool behind its approval gate. This is the real work of shipping an agent: reading transcripts of failures and adjusting tools and skills until the behavior is reliable.

Hardening it for production

With behavior stabilized, we added the operational layer. Full transcripts of every refund decision go to an audit log, so any disputed outcome can be reconstructed turn by turn. The refund MCP server runs with a credential that can only issue refunds, capped per session. Anything above the dollar threshold routes to the human-approval queue, where a support lead sees Claude's gathered evidence and recommendation and clicks approve or override. A kill switch can revoke the agent's tool access instantly. None of this is glamorous, and all of it is what separates a demo from a system you trust on a Saturday night.

We also built a small eval suite before launch — thirty real historical tickets with known correct outcomes, run against the agent on every change to a tool or the skill. If a skill edit causes the agent to mishandle a sale-item case it previously got right, the suite catches it before customers do. That regression gate is what let the team iterate quickly without fear.

The shipped outcome

In production, the agent handles the clear-cut majority of requests end to end within seconds of arrival, drafts the borderline cases for a human who now reviews rather than investigates, and escalates the rest with the order already looked up and the policy already applied. The queue stops backing up over weekends. Humans spend their time on judgment calls and angry edge cases instead of copy-pasting order numbers. The support lead updates policy by editing a skill file. The win is not that Claude is brilliant — it is that the boring, repetitive resolution work is automated with guardrails, and people are freed for the work that actually needs them.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Frequently asked questions

Why put the refund policy in a Skill instead of the system prompt?

An Agent Skill loads only when the task calls for it, which keeps the base context lean and lets non-engineers update the policy by editing the skill file rather than redeploying. It also lets you structure the policy as a checklist the model walks step by step, which dramatically reduces missed exceptions compared to a long paragraph in a prompt.

How do you stop the agent from issuing wrong refunds?

Several layers: a least-privilege refund tool that can only issue refunds and is rate-limited, a human-approval gate above a dollar threshold, a structured policy checklist in the skill, and an eval suite of real historical tickets that runs on every change. The agent automates clear-cut cases and routes anything borderline or ambiguous to a human with the context pre-gathered.

What broke first in the real build?

Edge cases and loops. The agent missed a policy exception buried in a long skill file, retried a malformed lookup repeatedly, and nearly followed injected instructions from pasted ticket content. All three were fixed by changing the surrounding design — restructuring the skill, adding rate limits and escalation rules, and enforcing that tool output is data, not instructions — not by changing the model.

How much human involvement remains after shipping?

Humans review borderline drafts and handle escalations, but they no longer investigate from scratch because the agent has already looked up the order and applied policy. The volume of pure copy-paste resolution drops sharply, and human time shifts to genuine judgment calls, which is the entire point of the build.

Bringing agentic AI to your phone lines

This same problem-to-shipped arc applies to live conversations. CallSphere builds multi-agent voice and chat assistants that look up orders, apply policy, and resolve or escalate mid-call — automating the clear-cut work and routing the rest to people. See a working example at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Claude Agent Walkthrough: From Refund Mess to Shipped

The problem, stated honestly

Designing the tools and the skill

The first build, and why it was wrong

Hardening it for production

The shipped outcome

Frequently asked questions

Why put the refund policy in a Skill instead of the system prompt?

How do you stop the agent from issuing wrong refunds?

What broke first in the real build?

How much human involvement remains after shipping?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Code GTM engineering is heading next

Where Claude Cowork is heading and how to prepare

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild