Building a Claude Agent: A Real End-to-End Walkthrough

Most agent content stops at the demo: a clever prompt, a slick tool call, a screenshot. The interesting work starts after that, when a real team has to take a vague business pain and turn it into something that runs against production systems without anyone losing sleep. This post walks one concrete, end-to-end build — a vendor-invoice triage agent for a mid-size company — from the first messy problem statement to a shipped, measured outcome. The company is illustrative, but every step is the work you actually do.

The problem, stated honestly

The finance team receives a few hundred vendor invoices a week by email. A human opens each one, checks it against an open purchase order, flags mismatches, and routes clean ones for payment. It is slow, it is error-prone late on Fridays, and it does not scale with headcount the team is not getting. The naive ask is "build an AI that pays our invoices." The honest restatement, after a half-day with the finance lead, is narrower and far more shippable: build an agent that reads each incoming invoice, matches it to a purchase order, and either routes a clean match for one-click human approval or flags a specific discrepancy for a person to resolve. The agent never moves money. That single scoping decision shrinks the blast radius enormously and is what makes the project safe to ship.

Designing the agent and its tools

With the task scoped, the architecture follows. The agent needs to read an invoice, look up a purchase order, compare them, and write a routing decision. Each of those becomes a tightly scoped tool exposed through an MCP server. Model Context Protocol is the open standard that lets Claude call these external tools and data sources through a server you control, and the discipline here is to expose exactly four capabilities — read invoice, fetch PO by number, fetch PO line items, and create a routing record — and nothing that could move funds or delete data.

flowchart TD
  A["Invoice email arrives"] --> B["Agent extracts fields"]
  B --> C["Fetch PO via MCP tool"]
  C --> D{"Lines & totals match?"}
  D -->|Yes| E["Route for one-click approval"]
  D -->|No| F["Flag specific discrepancy"]
  E --> G["Write routing record"]
  F --> G
  G --> H["Log full trace for audit"]

Alongside the tools, we author a Skill — a folder of instructions and reference material Claude loads when relevant — that teaches the agent the company's invoice rules: how POs are numbered, what tolerance is acceptable on totals, which discrepancy types matter, and the exact format of a routing record. The Skill is where domain knowledge lives, co-written by an engineer and the finance lead so it encodes how the team actually reasons, not how an outsider imagines they do.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Building the eval set before building the agent

The order matters: we assembled the evaluation set before tuning the agent. We pulled roughly a hundred and fifty historical invoices spanning clean matches, quantity mismatches, price discrepancies, missing POs, and a few genuinely ambiguous cases, and we recorded the correct human decision for each. This becomes the graded test the agent must pass. The point is not a single accuracy number; it is per-category scoring, because an agent that nails clean matches but mishandles discrepancies is dangerous precisely where it matters. We also seeded the set with a few adversarial cases — an invoice with text that tries to instruct the agent to approve itself — to confirm it treats invoice content as data, never as commands.

Iterating to a passing bar

The first run was unimpressive in a useful way: the agent matched clean invoices well but was too eager to call near-misses "clean," approving items it should have flagged. The fix was not a bigger model; it was sharper context and instructions. We tightened the Skill's tolerance rules, restructured the tool results so line items and totals were presented in a stable, comparable form, and added an explicit instruction to flag rather than guess when confidence was low. Each change was measured against the eval set, so we could see that flagging behavior improved without regressing clean-match accuracy. After several cycles, the agent crossed the bar the finance lead set: it had to flag every true discrepancy in the test set, even at the cost of occasionally over-flagging a clean one — failing safe by design.

Staged rollout, not a big bang

We did not point the agent at the live queue on day one. The rollout had three stages. First, shadow mode: the agent ran on every real incoming invoice for two weeks but its decisions went only to a log, compared daily against what the humans actually decided. This surfaced edge cases the historical set missed — a new vendor's odd numbering scheme — which we folded back into the Skill and evals. Second, assisted mode: the agent's recommendation appeared next to each invoice, and humans approved or corrected it, which both built trust and generated more labeled data. Third, live routing: the agent's clean-match routing flowed straight to one-click approval, while every flag still went to a person. At no stage did the agent gain the ability to pay.

The shipped outcome and how we measured it

The outcome was defined in the finance team's terms, not ours. The primary metric was human-minutes per invoice, which fell substantially because clean matches — the bulk of volume — no longer required manual PO lookup. The guardrail metric was missed discrepancies, which had to stay at or below the prior human baseline; the agent met it because its bias was to flag when unsure. We also tracked override rate in assisted mode, watching it decline as the Skill improved, and cost per invoice in tokens, which stayed comfortably under the labor it replaced. The project closed not when the agent worked in a demo, but when those four numbers held steady for a full month in live routing.

What the walkthrough teaches

The technical pieces — MCP tools, a Skill, a Claude model — were the easy third of the work. The decisive moves were scoping the task so the agent never did anything irreversible, building the eval set first so iteration was measurable, and rolling out in shadow-then-assisted-then-live stages so trust and data accumulated together. That sequence generalizes to almost any enterprise agent worth shipping.

Frequently asked questions

How long does an enterprise agent like this take to ship?

For a scoped task with available historical data, a small team can reach live routing in a handful of weeks. Most of the time goes to eval construction and the shadow and assisted stages, not to the initial build.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Why build the eval set before the agent?

Because without a graded test you are tuning blind. Building the eval set first turns every change into a measurable experiment and prevents the common trap of optimizing happy-path accuracy while quietly breaking the rare cases that matter most.

What does shadow mode actually buy you?

It runs the agent on real traffic with zero risk, comparing its decisions to human ones and surfacing edge cases your historical data missed. It converts unknown unknowns into labeled examples before the agent can affect anything.

Should the agent ever take the final irreversible action?

In this build, no — it routes and flags but never pays. Keeping the irreversible step behind one-click human approval is what makes a high-stakes agent safe to ship while still capturing most of the time savings.

Bringing the same playbook to your phone lines

CallSphere builds voice and chat agents this way — scoped tasks, real tools mid-conversation, staged rollout, and metrics that prove it works before it goes live. See a shipped example at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Building a Claude Agent: A Real End-to-End Walkthrough

The problem, stated honestly

Designing the agent and its tools

Building the eval set before building the agent

Iterating to a passing bar

Staged rollout, not a big bang

The shipped outcome and how we measured it

What the walkthrough teaches

Frequently asked questions

How long does an enterprise agent like this take to ship?

Why build the eval set before the agent?

What does shadow mode actually buy you?

Should the agent ever take the final irreversible action?

Bringing the same playbook to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Code GTM engineering is heading next

Where Claude Cowork is heading and how to prepare

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild