Skip to content
Agentic AI
Agentic AI8 min read0 views

Build a Claude Managed Agent: Step-by-Step Walkthrough (Managed Agents Orchestration)

A hands-on walkthrough: build a Claude Managed Agent with subagents, tools, success criteria, budgets, and idempotent writes.

Reading about outcome-driven agents is one thing; standing one up that survives contact with real input is another. This is a build log, not a brochure. We are going to take a single, concrete goal — "reconcile a vendor invoice against our purchase records and flag discrepancies" — and turn it into a working Claude Managed Agent with subagents, tools, success criteria, and a budget. By the end you will have a shape you can copy and bend to your own outcome.

I picked invoice reconciliation on purpose: it is separable (fetch records, fetch invoice, compare, summarize), it has a clear pass/fail outcome, and it punishes hand-waving. If your criteria are mushy, the agent will happily report "no discrepancies found" while missing a duplicated line item. So we will be specific.

Key takeaways

  • Start from a written outcome contract and testable success criteria before you touch any code.
  • Define tools as narrow, well-described functions; the description is part of the prompt the model reads.
  • Decompose into subagents only where the work is genuinely independent — here, "fetch invoice" and "fetch ledger" run in parallel, "compare" does not.
  • Set budgets up front so a confused run fails loudly instead of spiraling.
  • Read the trace after the first run and tighten the rubric where the verifier accepted weak output.

Step 1: write the outcome contract

Before any configuration, write down what "done" means in language a skeptical reviewer could grade. For our example: "Produce a reconciliation report listing every invoice line item, matched to its purchase-order line where one exists, with a discrepancy flag and reason for each mismatch. Pass only if every line is accounted for and totals are recomputed independently." That last clause — recompute totals independently — is what stops the agent from trusting the invoice's own arithmetic.

This contract becomes two things downstream: the orchestrator's north star and the verifier's rubric. Spend real time here. Every hour on the contract saves three on debugging.

Step 2: define the tools

Tools are how the agent touches the world. Keep each one narrow and describe it like you are onboarding a new engineer. Below is a tool definition in the JSON shape Claude expects — note how the description tells the model when to use it, not just what it does.

{
  "name": "get_purchase_orders",
  "description": "Fetch purchase-order line items for a vendor in a date range. Use this to build the ground-truth ledger BEFORE comparing against an invoice. Returns one row per PO line.",
  "input_schema": {
    "type": "object",
    "properties": {
      "vendor_id": { "type": "string" },
      "start_date": { "type": "string", "format": "date" },
      "end_date":   { "type": "string", "format": "date" }
    },
    "required": ["vendor_id", "start_date", "end_date"]
  }
}

You will define a parallel get_invoice_lines tool and a write_report tool. Resist the urge to make one mega-tool that "does reconciliation" — that hides the work from the model and the trace. Small tools keep the agent legible.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3: map the execution graph

Now decide what runs in parallel and what is sequential. Fetching the invoice and fetching the ledger are independent, so they fan out into two subagents. Comparison depends on both, so it waits. The diagram below is the graph the orchestrator will materialize.

flowchart TD
  A["Outcome: reconcile invoice"] --> B["Orchestrator plans graph"]
  B --> C["Subagent: fetch PO ledger"]
  B --> D["Subagent: fetch invoice lines"]
  C --> E["Subagent: match & diff lines"]
  D --> E
  E --> F["Recompute totals independently"]
  F --> G{"All lines accounted for?"}
  G -->|No| B
  G -->|Yes| H["Write reconciliation report"]

The back-edge from the verifier to the orchestrator matters: if a line is unaccounted for, control returns to the orchestrator, which can re-query with a wider date range rather than failing outright. That self-correction is the whole point of an outcome-driven loop.

Step 4: configure the managed agent

With tools and graph in hand, the configuration is mostly declarative. You give the runtime a system instruction for the orchestrator, the subagent menu, the tool scopes, the success criteria, and the budget. A trimmed configuration looks like this:

agent:
  goal: "Reconcile vendor invoice against purchase orders."
  success_criteria:
    - "Every invoice line is matched or explicitly flagged."
    - "Totals recomputed independently match within $0.00."
  subagents:
    - name: ledger_fetcher
      tools: [get_purchase_orders]
    - name: invoice_fetcher
      tools: [get_invoice_lines]
    - name: comparator
      tools: []          # reasons over results, no external calls
  budget:
    max_tokens: 120000
    max_subagents: 4
    max_seconds: 90

The comparator deliberately has no tools — it only reasons over the structured results the fetchers produced. Restricting its tool scope removes a class of mistakes where a confused agent re-fetches data mid-comparison and double-counts.

Pay attention to the budget numbers, because they are doing real work. The token ceiling stops an under-specified run from grinding through your account; the subagent cap prevents the orchestrator from fanning out a dozen redundant workers when it gets confused; the time ceiling protects any synchronous caller waiting on the result. Pick numbers from a successful run plus a margin, not from a guess. On this workload a clean reconciliation finishes well inside 120k tokens, so that ceiling is a circuit breaker, not a target. If a run ever bumps the ceiling, that is a signal to inspect — usually the criteria were too loose and the agent looped trying to satisfy a check it could never pass.

Step 5: run, read the trace, tighten

First runs are diagnostic, not final. Trigger the agent on a known invoice where you already know the answer — ideally one with a planted discrepancy — and read the trace top to bottom. You are checking three things: did the orchestrator's plan match your graph, did each subagent stay in its lane, and did the verifier actually grade against your criteria or wave it through.

On my first invoice run the verifier passed a report that silently dropped a zero-quantity line. The fix was not in code; it was in the criteria — I added "lines with zero quantity must still appear, flagged as informational." Re-run, and the verifier now catches the omission. This tighten-and-rerun loop is the actual work of building reliable agents.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

A second thing the trace exposes is where the orchestrator wasted effort. On run two I noticed it had spawned the comparator before the invoice fetcher returned, then idled waiting. The graph was right but the dependency was implicit; making the comparator's brief explicitly require both fetch results as inputs fixed the ordering. None of this is visible from the final report — only the trace shows you the path the agent took to get there, which is exactly why reading it is non-negotiable before you trust a run.

Step 6: add idempotency and guardrails

Before production, make the run safe to retry. The write_report tool should be idempotent — keyed on (vendor, period) so a retried run overwrites rather than duplicates. Fetch tools should be read-only. And put a human approval gate on anything that mutates a financial system; a reconciliation agent should propose a credit memo, never issue one unattended.

StepOutputSkip it and…
ContractGradable criteriaVerifier rubber-stamps junk
ToolsNarrow functionsOpaque, untraceable runs
GraphParallel vs serial mapWasted tokens or race conditions
BudgetHard ceilingsRunaway loops
Trace reviewTightened rubricSilent failures in prod

Common pitfalls

  • Configuring before contracting. If you can't write the success criteria, you can't build the agent. Write them first.
  • Mega-tools. A single "do reconciliation" tool hides logic from the trace. Split into small, described tools.
  • Parallelizing sequential work. Comparison can't start before fetching finishes; forcing it parallel just adds coordination cost.
  • No planted-discrepancy test. If your test input has no errors, you never learn whether the agent can catch one.
  • Non-idempotent writes. Retries duplicate artifacts. Key writes on a stable identity.

Ship your first managed agent in 5 steps

  1. Write the outcome contract as gradable success criteria before any config.
  2. Define small, well-described tools — never one mega-tool.
  3. Map the execution graph: which subtasks run in parallel, which wait.
  4. Configure budgets and tool scopes, then run on a planted-discrepancy test.
  5. Read the trace, tighten the rubric, and make every write idempotent before production.

Frequently asked questions

How long should the first build take?

The contract and tools are an afternoon if the data access already exists. The real time goes into the trace-review loop — expect two or three rounds of tightening criteria before the verifier is trustworthy. That iteration is the build, not overhead on top of it.

Can I start with one agent and add subagents later?

Yes, and you should. Build the single-agent version first, confirm the outcome check works, then split out subagents only where you see genuinely parallel work in the trace. Premature fan-out multiplies cost for no speed gain.

Where do I put human approval?

On any tool that mutates an external system of record. Keep fetch tools read-only and unattended; gate writes — credit memos, payments, ticket closures — behind an explicit approval step so the agent proposes and a human commits.

Bringing the same build to live conversations

This is exactly how CallSphere assembles its voice and chat agents: a clear outcome, narrow tools, scoped subagents, and a verifier that confirms the caller's goal was met. Want to see an outcome-driven agent answer a real call? It is live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.