End-to-End Claude Agent Orchestration: A Real Walkthrough
A realistic Claude agent orchestration build from messy problem to shipped outcome: decomposition, MCP tools, evals, and staged rollout.
Abstract advice about orchestration only goes so far. To make the patterns concrete, this post walks a single realistic project from the moment someone names the pain to the moment the system is shipped and trusted. The scenario is one most companies recognize: a support operations team drowning in inbound vendor invoices that arrive as PDFs in a shared inbox, need to be validated against purchase orders, and then entered into a finance system. It is repetitive, error-prone, and exactly the kind of work an orchestrated Claude system handles well — once you build it correctly.
We will follow the project through decomposition, tool wiring, the first ugly prototype, the eval suite that earned trust, and a staged rollout. The point is not the invoices; it is the shape of the journey, which repeats across nearly every orchestration build.
Step 1: Frame the problem as a flow, not a feature
The team's first instinct was "build an invoice bot." That framing hides the real structure. The actual work is a sequence: read the PDF, extract the fields, match against an open purchase order, flag discrepancies, and either submit or escalate. Writing it out as a flow immediately reveals where judgment lives (discrepancy handling) and where the work is mechanical (extraction). It also reveals the irreversible step — submitting to finance — which tells you exactly where a human gate belongs.
This reframing matters because orchestration design is fundamentally about decomposition. A good decomposition gives each Claude subagent a task narrow enough to do reliably, with a clean handoff to the next. A bad one asks a single agent to do everything in one giant context, which works in the demo and falls over on the hundredth weird invoice.
Step 2: Decompose into subagents with clear handoffs
The team mapped the flow to an orchestrator and three focused subagents, plus a human approval gate before the one irreversible action. The diagram captures the design they shipped.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["New invoice PDF"] --> B["Orchestrator"]
B --> C["Extractor subagent: fields"]
C --> D["Matcher subagent: PO lookup via MCP"]
D --> E{"Discrepancy found?"}
E -->|No| F["Human approves submission"]
E -->|Yes| G["Reviewer subagent: explain & route"]
G --> F
F --> H["Submit to finance, log audit"]
Each subagent has exactly the tools it needs and nothing more. The extractor only reads the PDF. The matcher reaches the finance system through a Model Context Protocol server that exposes a read-only purchase-order lookup. The reviewer composes a human-readable explanation of any mismatch. The orchestrator holds the thread together but takes no irreversible action itself. This least-privilege structure means a confused extractor cannot accidentally write to finance — the capability simply is not in its context.
Step 3: Wire the real tools through MCP and Skills
The tools came in two layers. MCP servers connected Claude to the purchase-order database and the finance submission endpoint — the standard way to give Claude live, structured access to external systems. Agent Skills taught Claude how to use those tools well: a folder of instructions describing the company's invoice conventions, the fields that matter, the tolerance rules for quantity and price mismatches, and worked examples of tricky cases. MCP is the wiring; Skills are the training. The team learned quickly that a great Skill turned a mediocre extractor into a reliable one without touching a single line of orchestration code.
Step 4: The ugly first prototype
The first working version was, predictably, rough. It extracted fields well on clean invoices and fell apart on scanned ones with rotated pages and merged line items. Rather than over-engineering, the team did the right thing: they ran fifty real historical invoices through it and read every transcript. The failures clustered — most were a handful of vendor-specific quirks. They encoded those quirks into the extractor's Skill as examples, and accuracy jumped. This read-the-transcripts loop is the single most productive activity in any orchestration build, and it is where the four-to-six-week spike earns its keep.
Step 5: The eval suite that earned trust
Nobody let this system touch finance on the strength of a good demo. The team built a graded eval suite of about a hundred labeled invoices spanning the easy, the weird, and the genuinely ambiguous, with known-correct outcomes. Every change to a prompt or Skill ran against the suite, and the release gate was a target accuracy on extraction and a near-zero false-submit rate on the irreversible step. When a Skill edit improved clean invoices but regressed scanned ones, the eval caught it the same day. An eval suite is a fixed set of labeled tasks with known-correct outcomes used to grade an agent's quality and gate releases. Without it, the team would have been flying blind on every change.
Step 6: Staged rollout from shadow to autonomy
Rollout was deliberately gradual. First, shadow mode: the system processed real invoices but submitted nothing, and a human compared its proposals to what was actually done. Then suggestion mode: it pre-filled submissions for human approval, which is where it lived for a few weeks while trust accumulated. Only the clean, high-confidence, no-discrepancy invoices eventually graduated to auto-submission, with every other case still routing to a human. The blast radius stayed small the entire way because autonomy was earned, invoice category by invoice category, against the eval numbers — never granted on optimism.
What shipped, and what it cost
The shipped system handled the bulk of routine invoices end to end, escalated the genuinely ambiguous ones with a clear explanation, and kept a human firmly in front of the one irreversible action. The multi-agent design cost more tokens than a single prompt would have, but the isolation it bought — a reviewer that did not share the extractor's assumptions, tools scoped per agent — was worth it for a workflow touching money. The lesson generalizes: decompose into narrow tasks, wire real tools through MCP and Skills, gate on evals, and earn autonomy in stages.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Frequently asked questions
Why use multiple subagents instead of one big prompt?
Narrow tasks are more reliable, and least-privilege tooling means a confused agent literally cannot perform actions outside its scope. The independence also enables verification: a fresh reviewer subagent catches mistakes a single context would carry forward.
When in the project should the eval suite be built?
As early as you have a few dozen real examples. The suite is what lets you change prompts and Skills with confidence; building it late means every improvement before then was an unmeasured guess.
How long did a build like this take?
A focused team reaches a trustworthy system in a handful of weeks, with most of that time spent reading transcripts and tightening the eval suite rather than writing orchestration code. The plumbing is fast; earning trust is the slow part.
What stops the system from submitting a bad invoice?
The irreversible submission step is gated by a human until a category of invoices proves itself against a near-zero false-submit eval metric. Autonomy expands only where the data shows it is safe.
From shipped workflow to live conversations
CallSphere runs this same problem-to-production arc for voice and chat — orchestrated agents that answer every call, look up records mid-conversation, and book work, rolled out from shadow to autonomy. See a live build at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.