Claude Agent Walkthrough: Invoice Triage End-to-End
A realistic end-to-end Claude deployment for invoice triage — scoping, MCP tools, the eval harness, and shadow-mode rollout from problem to shipped outcome.
Most write-ups about building with Claude stop at a toy demo. This one follows a single, ordinary enterprise problem all the way from "this is painful" to "this is in production and we trust it." The problem is invoice triage — the unglamorous, high-volume work of receiving supplier invoices, matching them to purchase orders, flagging discrepancies, and routing the clean ones for payment. It is a perfect first agentic deployment: high volume, clear rules with messy edge cases, real money at stake, and a well-defined success metric.
We will walk the whole arc: scoping the outcome, designing the agent and its tools, building the eval harness, shipping behind a shadow mode, and expanding scope once the numbers earn it. The specifics are invoice triage, but the shape applies to any document-heavy workflow you are eyeing for Claude.
Key takeaways
- Start from a measurable outcome ("reduce manual touches per invoice"), not from "let's use AI."
- Decompose the workflow into discrete tool-backed steps — extract, match, check, route — so each is testable and scoped.
- Use MCP tools for the data and actions; keep the agent's reasoning in the prompt and skills, not hard-coded glue.
- The eval harness comes before launch: golden invoices with known-correct outcomes gate every release.
- Ship in shadow mode first — the agent decides, a human approves — and expand autonomy as accuracy proves out.
Step one: scope the outcome and the blast radius
Before any building, we wrote down two things. The outcome: cut the share of invoices that need a human touch from roughly all of them to a target, while never auto-approving an invoice that does not match its purchase order. The blast radius: the agent may route and flag freely, but it may not release payment — that stays human until the numbers earn more trust. This single decision shaped everything downstream. By keeping payment release out of the agent's hands at launch, the worst-case outcome of any error was a misrouted or mis-flagged invoice, which a human catches in the approval queue — not a wrong payment.
Scoping the outcome this tightly also gave the eval owner a concrete target. "Handle invoices well" is untestable. "Correctly match invoice to PO, flag any discrepancy over a threshold, and never route a mismatch to auto-approve" is a spec you can write tests against.
Step two: decompose into tool-backed steps
The agent does not do invoice triage as one monolithic act of reasoning. It runs a sequence of steps, each backed by a tool and each independently checkable. The diagram shows the full path from inbound invoice to a routed outcome.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Invoice arrives"] --> B["Claude extracts fields via doc tool"]
B --> C["Match to PO via MCP lookup"]
C --> D{"Amounts & items match?"}
D -->|Yes| E["Route to approval queue"]
D -->|No| F["Flag discrepancy with reason"]
F --> G["Route to human reviewer"]
E --> H["Audit log + eval sample"]
G --> HEach box is a place we can test and contain. Extraction is backed by a document-parsing tool; matching is an MCP server that looks up the purchase order in the ERP; the match decision is the agent's reasoning over both. By splitting the work this way, a failure in extraction shows up as an extraction error in the logs, distinct from a matching error or a routing error. That separation is what makes the system debuggable instead of a black box, and it is also what lets the eval owner test each step against its own golden cases rather than only judging the end-to-end result, which is the difference between knowing the agent works and knowing exactly where it breaks when it does.
Step three: build the MCP tools with tight scope
We built two MCP servers. One exposed a read-only purchase-order lookup keyed by PO number and supplier — no general queries, no writes. The other exposed a routing action with a fixed set of destinations (approval queue, human review, discrepancy hold) and nothing else. The agent could not release payment because no tool existed for it. That is the cleanest possible containment: the dangerous capability is simply absent.
The agent's logic lived in its prompt and an Agent Skill that encoded the company's matching rules — tolerance thresholds, which discrepancy types are auto-flag versus auto-hold, how to handle partial deliveries. Keeping those rules in a skill rather than in code meant the finance team could read and revise them without a deploy, and the agent loaded them only when working on invoices, keeping the context window lean.
Step four: build the eval harness before launch
This is where the deployment earned its trust. The eval owner assembled a set of golden invoices — real historical cases with known-correct outcomes, deliberately including the gnarly ones: partial deliveries, currency mismatches, duplicate submissions, off-by-a-cent rounding. Each golden case had an expected decision and an expected discrepancy flag.
The harness ran the agent against every golden case on every change and scored two things: did it reach the correct routing decision, and did it correctly catch (or correctly not raise) a discrepancy. The bar to ship was high on the metric that mattered most — never route a real mismatch to auto-approve — even if it meant over-flagging borderline cases early. We would rather send a clean invoice to a human than auto-approve a bad one.
Step five: ship in shadow mode, then expand
The agent went live in shadow mode: it made its decision on every real invoice, but a human approved the routing before it took effect. For two weeks we collected the delta — where the agent and the human disagreed — and fed every disagreement back into the golden set. This did two things: it caught real-world cases the eval set had missed, and it built the finance team's confidence by showing them the agent's reasoning on actual invoices.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Once the agreement rate held steady and the eval suite was green, we let the agent route clean, well-matched invoices directly to the approval queue without per-invoice human confirmation, while still holding discrepancies for review. Payment release stayed human. The result was the outcome we scoped: a large share of invoices flowing through with no manual touch, every discrepancy still caught, and a full audit trail behind each decision.
Before and after, at a glance
| Stage | Before agent | After agent (shipped) |
|---|---|---|
| Field extraction | Manual data entry | Tool-backed, agent-verified |
| PO matching | Human lookup per invoice | MCP lookup + agent reasoning |
| Discrepancy catch | Inconsistent, fatigue-prone | Rule-driven, eval-gated |
| Clean-invoice routing | Manual | Autonomous with audit log |
| Payment release | Human | Human (by design) |
Common pitfalls
- Starting with "let's use AI" instead of an outcome. Without a measurable target and a scoped blast radius, the project drifts and never ships.
- One giant prompt for the whole workflow. Monolithic reasoning is untestable and undebuggable. Decompose into tool-backed steps.
- Exposing broad tools. A general database query or a payment tool the agent does not yet need is a liability. Build the narrow tools the workflow requires and no more.
- Skipping shadow mode. Going straight to full autonomy forfeits the cheapest source of real eval cases — the disagreements between agent and human.
- Freezing the eval set. Production surfaces cases your golden set missed. Feed every disagreement back in so the harness keeps getting stronger.
Frequently asked questions
How long does a deployment like this take?
The build and eval harness for a scoped workflow like invoice triage is typically a few weeks, followed by a shadow-mode period of one to two weeks before expanding autonomy. The eval harness and shadow data, not the prompt, are what take the time — and they are what make the result trustworthy.
Why keep payment release human even after the agent is accurate?
Because the cost of a wrong payment is far higher than the cost of a human approving clean invoices. Keeping the highest-blast-radius action human at launch lets you ship value quickly while capping worst-case harm. You can revisit it later with strong evidence, not at launch on hope.
What did the Agent Skill actually contain?
The company's matching rules: tolerance thresholds, which discrepancies auto-flag versus auto-hold, and how to handle partial deliveries and currency differences. Putting these in a skill let the finance team revise the logic without a deploy and kept the agent's context window focused only on the rules relevant to invoices.
How do you know when to expand autonomy?
When the eval suite stays green on the metric that matters most and the agent's live decisions agree with human reviewers at a steady, high rate. Expand one capability at a time, watch the signals, and keep the highest-risk action gated longest.
Bringing agentic AI to your phone lines
The same problem-to-production arc powers CallSphere on voice and chat — agents that extract intent, call tools mid-conversation, and route or resolve every request, shipped behind real evals. See a working deployment at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.