From problem to shipped: an agentic Claude Code walkthrough

Most writing about agentic AI stops at the demo. Someone shows a prompt, an agent does something impressive, and the curtain falls before you see what it took to make the result real. This post does the opposite. It follows one team at a Built-with-Opus hackathon all the way from a vague problem statement to a working, verified feature deployed behind a flag — including the messy middle that demos skip.

The problem was ordinary, which is why it is useful: a small SaaS team had a support inbox full of refund requests, and triaging them by hand ate hours a day. They wanted an agent that could read a request, decide whether it met the refund policy, and either draft an approval or escalate. By the end of the day they had it shipped. Here is exactly how, step by step.

Step one: turning a vague ask into a spec

The team's first move was not to prompt anything. It was to write a one-page spec. They listed the inputs (the email body and the customer's order history), the decision rule (the refund policy, written out as explicit conditions), and the outputs (approve-and-draft, deny-and-draft, or escalate-to-human). Crucially, they wrote three example cases with the correct answer for each — one clear approval, one clear denial, one genuine edge case.

Those examples became the acceptance test. Before a single line of agent code, they had a definition of done they could check against. This is the step most teams skip, and skipping it is why so many agent projects feel like they almost work forever. The spec took twenty minutes and saved the entire afternoon.

Step two: wiring the agent's tools

With the spec set, they connected Claude Code to the data it needed. The order history lived in their database, so they exposed it through an MCP server with a read-only query tool. The refund policy went into an Agent Skill — a folder with the policy written as clear instructions plus a couple of worked examples, which Claude loads when the task is relevant. The drafting and escalation actions became two more tools, each with tight argument schemas.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Refund email arrives"] --> B["Agent reads request & order history via MCP"]
  B --> C["Load refund-policy Skill"]
  C --> D{"Meets policy?"}
  D -->|Clear yes| E["Draft approval"]
  D -->|Clear no| F["Draft denial with reason"]
  D -->|Ambiguous| G["Escalate to human queue"]
  E --> H["Run against test cases"]
  F --> H
  G --> H
  H -->|Pass| I["Ship behind feature flag"]

Notice the shape: the agent reads through a read-only tool, reasons using a skill that encodes the policy, and acts through narrow tools that can only draft or escalate — never send. That last constraint was deliberate. In the first version, the agent could only produce a draft for a human to send. Keeping the irreversible action (actually emailing the customer) behind a person meant the worst early bug was a bad draft, not a wrongly refunded order.

Step three: the first run and the inevitable failures

The first end-to-end run failed the edge case, exactly as a good test should. The agent approved a refund that the policy actually excluded because the purchase was outside the window. Reading the transcript showed why: the agent had the order date but had not computed the elapsed time correctly. This is the kind of bug you only catch by reading the agent's reasoning, not just its output.

The fix was not a prompt tweak — it was giving the agent a tool. They added a tiny date-difference function so the agent did not have to do date arithmetic in its head. Re-ran the test cases; all three passed. The lesson generalizes: when an agent reliably fails at a mechanical sub-task, do not coax it with better wording, give it a deterministic tool that just does the thing correctly.

Step four: hardening against the messy real world

Passing three handcrafted cases is not the same as surviving real inputs. So the team pulled twenty real (anonymized) past requests and ran the agent over all of them, comparing its decisions to what humans had actually decided. It matched on seventeen, escalated two correctly, and got one wrong — a case where a customer's phrasing was ambiguous about which order they meant.

Rather than chase that single case with more prompt engineering, they made a judgment call that defines good agentic design: when the agent's confidence in which order is low, escalate. They added that to the policy skill. The point is that the right answer to an ambiguous case is often "hand it to a human," and building that exit ramp is more valuable than squeezing out one more automated decision.

Step five: shipping behind a flag with a feedback loop

They shipped it turned on for ten percent of incoming refund emails, with every agent decision logged alongside the human's eventual action. This gave them a live, growing eval set. Over the following week, the agreement rate between agent and human was the metric they watched; when it held steady above their threshold, they widened the rollout. When a disagreement appeared, it became a new test case.

This is the whole arc that demos hide. The agent was never "done" in one shot. It went from spec, to tools, to a failing test, to a real fix, to hardening on real data, to a gated rollout with a feedback loop that keeps improving it. The agent did enormous amounts of the work — reading, reasoning, drafting — but the humans owned the spec, the tests, and the boundary of what the agent was allowed to do irreversibly.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

What made this work in a single day

Three things compressed weeks of work into an afternoon. The spec-and-examples discipline meant the team always knew whether they were done. The tool-and-skill architecture meant the agent had clean access to data and policy instead of guessing. And the irreversible-action boundary meant they could move fast without fear, because the costliest mistakes were structurally impossible. None of these are exotic; all of them are repeatable.

Frequently asked questions

Why write a spec before prompting the agent?

Because without a definition of done, you cannot tell whether the agent's output is correct, and you end up endlessly tweaking prompts. A one-page spec with a few worked examples becomes your acceptance test and saves far more time than it costs.

When should I give the agent a tool instead of a better prompt?

Whenever it reliably fails at a mechanical sub-task like date math, lookups, or precise calculation. A deterministic tool does those correctly every time; coaxing the model into doing them in its head is fragile. Reserve prompting for judgment, not arithmetic.

How do I keep an agent safe before I trust it?

Keep the irreversible action — sending, charging, deleting — behind a human. Let the agent read, reason, and draft freely, but require a person to commit the consequential step until the agreement rate on real data earns more autonomy.

How do I know when to widen the rollout?

Ship behind a flag and track agreement between the agent's decisions and the human ground truth on live traffic. When that rate holds above your threshold over enough volume, widen it; every disagreement becomes a new test case that keeps the system honest.

From inbox to phone line

CallSphere takes this same problem-to-shipped agentic workflow to voice and chat — agents that read context, reason over policy, call tools mid-conversation, and escalate to humans when it matters. See a working example at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

From problem to shipped: an agentic Claude Code walkthrough

Step one: turning a vague ask into a spec

Step two: wiring the agent's tools

Step three: the first run and the inevitable failures

Step four: hardening against the messy real world

Step five: shipping behind a flag with a feedback loop

What made this work in a single day

Frequently asked questions

Why write a spec before prompting the agent?

When should I give the agent a tool instead of a better prompt?

How do I keep an agent safe before I trust it?

How do I know when to widen the rollout?

From inbox to phone line

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild