---
title: "A Claude Managed Agent Walkthrough, Problem to Shipped"
description: "Follow one real problem from messy spec to a shipped Claude multi-agent system — orchestration, evals, shadow-running, and the actual outcome."
canonical: https://callsphere.ai/blog/a-claude-managed-agent-walkthrough-problem-to-shipped
category: "Agentic AI"
tags: ["agentic ai", "claude", "managed agents", "use case", "multi-agent", "orchestration", "automation"]
author: "CallSphere Team"
published: 2026-04-05T17:46:22.000Z
updated: 2026-06-07T01:28:23.112Z
---

# A Claude Managed Agent Walkthrough, Problem to Shipped

> Follow one real problem from messy spec to a shipped Claude multi-agent system — orchestration, evals, shadow-running, and the actual outcome.

Abstract advice about agents is cheap. What teams actually want is to watch one go from a vague Slack complaint to a thing running in production, with every real decision visible. So this post does exactly that. We take one ordinary operational pain — invoice exceptions piling up faster than the finance team can clear them — and walk it all the way to a shipped Claude Managed Agent that handles the routine cases and escalates the rest. No fictional metrics; just the moves you'd actually make.

## Key takeaways

- A real agent project starts with a **boring, well-bounded, high-volume task** — not the most impressive one.
- The work splits cleanly into **classify, gather, decide, act** — and only some of those steps should be autonomous at first.
- You earn autonomy by **shadow-running** the agent against decisions humans already made and comparing.
- Orchestration here is modest: one classifier subagent, one enrichment subagent, one orchestrator — **not a swarm**.
- The shipped win isn't "AI does finance" — it's **routine exceptions cleared automatically, hard ones escalated with context attached**.

## Step 1 — Pin down the actual problem

The complaint was "invoice exceptions are killing us." Useless as a spec. So we sat with the finance analyst for an hour and watched. An exception is any incoming vendor invoice that doesn't auto-match the purchase order: wrong amount, missing PO number, duplicate, tax mismatch, or a vendor not in the system. About 70% of exceptions, it turned out, were three boring categories — small amount variances under a tolerance, missing PO numbers that were actually present in the email body, and obvious duplicates. The other 30% genuinely needed a human.

That observation is the whole project. The agent's goal isn't "handle invoices." It's "clear the 70% that follow known rules, and hand the 30% to a human with the analysis already done." A tight, measurable outcome on a high-volume, low-variance task is the ideal first agent.

## Step 2 — Decompose into a small agent topology

The temptation is to build one giant agent or, worse, a ten-agent swarm. The right answer here is three roles. An **orchestrator** receives the exception and owns the outcome. A **classifier subagent** reads the invoice and PO data and labels the exception type with a confidence score. An **enrichment subagent** goes and fetches whatever context a human would gather — the matching PO, prior invoices from this vendor, the email thread. Each is a Claude Managed Agent with scoped, read-mostly tools via MCP connectors to the accounting system and mailbox.

```mermaid
flowchart TD
  A["Invoice exception arrives"] --> B["Orchestrator agent"]
  B --> C["Classifier subagent: label + confidence"]
  C --> D{"Confidence high & routine type?"}
  D -->|Yes| E["Enrichment subagent gathers context"]
  E --> F{"Resolution within auto-rules?"}
  F -->|Yes| G["Propose fix, apply after guardrail"]
  F -->|No| H["Escalate with analysis attached"]
  D -->|No| H
```

This is deliberately not a swarm. Multi-agent runs cost several times the tokens of a single agent and add coordination risk, so we use exactly the parallelism the problem needs: classification and enrichment are distinct jobs with different tools, and that's the only reason to split them.

## Step 3 — Write the evals before trusting anything

Finance is a domain where being confidently wrong is expensive, so the agent earns trust empirically. We pulled six months of resolved exceptions — inputs and the resolution a human chose — into a fixture set. The eval grades the classifier against the human's category and the proposed resolution against the human's action.

```
const cases = loadResolvedExceptions("fixtures/2025-H2/*.json");
let correct = 0, escalatedSafely = 0;
for (const c of cases) {
  const out = await runManagedAgent("invoice-orchestrator", { input: c.invoice });
  if (out.action.type === "escalate") {
    // never penalize a safe escalation, but track it
    if (c.humanAction.type !== "escalate") escalatedSafely++;
    continue;
  }
  // for auto-resolved cases, require an exact match to the human action
  if (deepEqual(out.action, c.humanAction)) correct++;
}
console.log({ autoAccuracy: correct / autoCount, overEscalation: escalatedSafely });
```

The two numbers that matter: when the agent *acts*, how often does it match what the human did (must be very high), and how often does it escalate something it could have handled (the cost of caution). We tune the confidence threshold until auto-resolutions are essentially always right, accepting more escalation than strictly necessary at launch. Over-caution is recoverable; a wrong auto-applied fix to a ledger is not.

## Step 4 — Shadow-run, then grant narrow autonomy

Before the agent touches anything, it runs in shadow mode for two weeks: it produces a proposed action for every live exception, but a human still resolves them, and we diff agent-vs-human nightly. When the agent's proposals on routine types matched humans consistently, we flipped on autonomy *only* for those types, behind the propose-then-apply guardrail — small variances and recoverable missing-PO fixes apply automatically; anything touching amounts above tolerance or new vendors still escalates.

## Step 5 — The shipped outcome

What went live is unspectacular and exactly right. Routine exceptions in the three known categories clear automatically, with an audit record of what the agent saw and did. The remaining cases land in the analyst's queue with the classification, the matching PO, vendor history, and the relevant email already attached — so the human spends their time deciding, not gathering. The analyst stopped drowning in the boring 70% and got their attention back for the 30% that needs judgment. That's the whole win, and it's a real one.

A few things about the production version are worth naming because they're easy to skip and expensive to omit. Every auto-resolution writes a structured record — the invoice, the classification and its confidence, the rule that fired, and the exact action taken — so any decision can be replayed and reversed. The orchestrator carries strict harness-level caps on tokens and tool calls per exception, so a malformed invoice can't send a subagent into a retry spiral. And the autonomy boundary is a single config value, not logic buried in a prompt: when the team gains confidence and wants to fold a fourth routine category into the auto-resolve set, they add it to the allowlist and re-run the eval suite rather than reopening the codebase. That last property is what turns a one-off win into a system that keeps absorbing more of the toil over time.

## What to autonomize versus escalate

| Exception type | Disposition | Why |
| --- | --- | --- |
| Amount variance under tolerance | Auto-resolve | Deterministic rule, recoverable |
| PO number present in email body | Auto-resolve | Extraction, verifiable against PO |
| Obvious duplicate | Auto-flag, hold | Safe to hold, cheap to confirm |
| Amount over tolerance / new vendor | Escalate with context | Judgment + financial risk |

## Common pitfalls in an end-to-end build

- **Picking the flashiest task first.** Start with high-volume and low-variance. You want a clean win that builds trust, not a moonshot that stalls.
- **Over-orchestrating.** Three focused agents beat a ten-agent swarm here. Add parallelism only where the work is genuinely independent and tool-distinct.
- **Granting autonomy before shadow-running.** Without comparing the agent to decisions humans already made, you're guessing at its reliability.
- **Optimizing away escalations too early.** At launch, over-caution is cheap and a wrong auto-action is expensive. Tighten the threshold gradually as data accrues.
- **Escalating with no context.** If the human still has to gather everything, you saved nothing. The agent's job on hard cases is to do the legwork, then hand off.

## Replicate this in seven steps

1. Sit with the operator and find the high-volume, rule-bound subset of the task.
2. Define the outcome as "auto-clear the routine, escalate the rest with analysis."
3. Decompose into the smallest agent topology that fits — often orchestrator plus one or two subagents.
4. Build an eval set from real resolved cases; measure auto-accuracy and over-escalation.
5. Shadow-run for a couple of weeks and diff against humans daily.
6. Grant autonomy only to high-confidence routine types, behind a propose-then-apply guardrail.
7. Ship, log a replayable audit trail, and tighten thresholds as data grows.

## Frequently asked questions

### How long does a build like this take?

The agent logic is days of work; the shadow-run and trust-building is the long pole, usually a couple of weeks. The eval set is the artifact that lets you compress everything else safely.

### Why not let one agent do classification and enrichment?

You could, but separating them keeps each agent's tools and prompts focused, makes the classifier independently testable, and lets enrichment run only when classification clears the bar — saving tokens on cases that escalate immediately.

### What if the agent is wrong on an auto-resolved case?

The audit trail makes it reversible and diagnosable, and the guardrail keeps auto-actions to recoverable, sub-tolerance changes. You set the autonomy boundary exactly where mistakes are cheap to undo.

### Does this generalize beyond finance?

Yes — support triage, lead qualification, claims intake, and content moderation all share the shape: a high-volume stream, a routine majority that follows rules, and a minority needing human judgment. The walkthrough transfers almost directly.

## The same pattern, on the phone

CallSphere runs this exact classify-gather-decide-act loop over **voice and chat** — agents that handle the routine calls automatically and escalate the rest with full context, around the clock. See it working at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/a-claude-managed-agent-walkthrough-problem-to-shipped