---
title: "Build a Claude Managed Agent: Step-by-Step Walkthrough"
description: "End-to-end walkthrough to ship a Claude Managed Agent: scope, system prompt, tools, the run loop, model routing, evals, and deployment you can defend."
canonical: https://callsphere.ai/blog/build-a-claude-managed-agent-step-by-step-walkthrough
category: "Agentic AI"
tags: ["agentic ai", "claude", "claude managed agents", "tutorial", "agent sdk", "anthropic", "production"]
author: "CallSphere Team"
published: 2026-03-25T08:23:11.000Z
updated: 2026-06-06T21:47:44.428Z
---

# Build a Claude Managed Agent: Step-by-Step Walkthrough

> End-to-end walkthrough to ship a Claude Managed Agent: scope, system prompt, tools, the run loop, model routing, evals, and deployment you can defend.

Reading about agent architecture is one thing; standing one up that survives real traffic is another. This walkthrough is the path an engineer can actually follow on a Tuesday afternoon to take a Claude Managed Agent from an empty repo to something running against live tools. We'll build a single concrete example — an agent that triages incoming support tickets, looks up account data, and either resolves or escalates — and use it to ground every step.

The goal is not a toy. By the end you'll have the same skeleton teams use to reach production fast: a tightly scoped task, a clean system prompt, two real tools, a managed run loop, a small eval suite, and a deployment posture you can defend in a review.

## Step 1: Scope the task before you write any code

The single biggest predictor of whether an agent ships is how narrowly you scoped it on day one. Write down, in one paragraph, exactly what the agent does and — just as important — what it refuses to do. For our triage agent: *read an incoming ticket, classify urgency, look up the customer's plan and recent orders, then either draft a resolution or escalate with a reason. It does not issue refunds, change account settings, or email anyone directly.*

That refusal list is not bureaucracy; it's your guardrail spec. Every capability you don't grant is a class of failure you never have to debug. Resist the urge to make the first version do everything. A narrow agent that works beats a broad agent that's flaky, and you can always widen scope once the loop is proven.

## Step 2: Write the system prompt as a job description

Treat the system prompt like onboarding a sharp new hire who is fast but has no context about your company. State the role, the workflow in order, the tools available and when to use each, the output format, and the hard boundaries. Be concrete about the success criteria: "A good resolution cites the specific order or policy it relied on." Vague prompts produce vague agents.

Keep the prompt focused on behavior and decision-making, not on data that changes per ticket — that comes in through context and tools. A prompt that hardcodes today's promotions will be wrong next week; a prompt that says "check the promotions tool before quoting a discount" stays correct. The prompt is your stable policy; the tools supply the moving facts.

## Step 3: Define the tools the agent can call

Our agent needs two tools to start: `get_account(customer_id)` returning plan and order history, and `escalate(ticket_id, reason, priority)` creating a handoff. Each tool gets a JSON schema with a tight description, required fields, and enums where the values are fixed. Write the descriptions for the model, not for yourself: "Use this to fetch the customer's current plan and last five orders before drafting any resolution" tells Claude exactly when to reach for it.

```mermaid
flowchart TD
  A["Define scope & refusals"] --> B["Write system prompt"]
  B --> C["Declare tools with schemas"]
  C --> D["Wire managed run loop"]
  D --> E{"Agent action?"}
  E -->|Tool call| F["Execute tool, return result"] --> E
  E -->|Done| G["Run eval suite"]
  G -->|Pass| H["Deploy & monitor"]
  G -->|Fail| B
```

Notice the diagram's failing-eval arrow points back to the prompt, not forward to deploy. That feedback edge is the discipline that separates teams who ship reliable agents from teams who ship and then firefight. We'll close that loop in step 6.

## Step 4: Wire the managed run loop

With the managed agent, you don't hand-roll the observe-act-incorporate cycle. You register your system prompt and tools with the runtime, point it at your model, and hand it a task. The loop pulls context together, lets Claude decide the next action, executes any tool call against your handlers, feeds the result back, and repeats until done or a budget limit hits.

Your real work here is the tool handlers — the functions that actually run when the agent calls `get_account`. Make them idempotent and defensive: validate the inputs, return a clear structured error rather than throwing, and never trust the agent to have passed a perfect argument. The runtime will surface your structured error back to the model, which can then correct itself. A handler that returns `{"error": "customer_id not found"}` teaches the agent to recover; a handler that crashes ends the run.

## Step 5: Choose models and set budgets

For a triage agent, route most steps through Sonnet 4.6 — it's the workhorse for competent tool-driven reasoning. Reserve Opus 4.8 for the genuinely ambiguous escalations where judgment matters, and consider Haiku 4.5 for the cheap first-pass classification of urgency. Mixing models across the loop is normal and keeps cost sane without dumbing the agent down on the hard steps.

Then set hard budgets in the control plane: a maximum number of steps and a token ceiling per task. A triage that hasn't resolved in, say, a dozen steps is almost certainly stuck, and you want it to stop and escalate rather than spin. Budgets aren't just cost control; they're a correctness signal. An agent that hits its step limit is telling you the task was underspecified or a tool is misbehaving.

## Step 6: Build a small eval suite before you ship

Collect ten to twenty real tickets with known-good outcomes and turn them into test cases: given this ticket, the agent should escalate with priority "high," or should draft a resolution citing order #1234. Run the agent against them and score the results. You don't need a fancy framework to start — a script that runs each case and checks the final action against the expected one catches most regressions.

This is the edge from the diagram. When a change to the prompt fixes one ticket but breaks two others, the eval suite tells you immediately, before your customers do. Add a new case every time you find a failure in production. Over a few weeks the suite becomes the asset that lets you change the agent confidently — the thing that actually makes iteration 10x faster, because you stop being afraid of your own edits.

## Step 7: Deploy, log, and watch the first runs

Ship it behind the control plane with full step logging on. For the first day, read the traces of real runs end to end — every tool call, every decision, every result. You will find surprises: a tool description the model misread, an edge case in account data, an escalation reason that's too terse. Each surprise becomes either a prompt tweak or a new eval case. After the first batch of fixes, the agent stabilizes and you shift from reading every trace to sampling and alerting on anomalies.

## Frequently asked questions

### How long does it really take to build a first managed agent?

A scoped single-purpose agent like this triage example is a day or two of work to a running prototype, then a week or so of eval-driven hardening before you trust it with live traffic. The managed runtime removes the orchestration build, so most of your time goes into tool handlers, the prompt, and the eval suite rather than the loop itself.

### Do I need MCP servers from day one?

Not necessarily. You can start with directly registered tool handlers to prove the loop, then move integrations behind MCP servers as you add data sources or want to reuse tools across agents. Starting simple gets you to a working agent faster; MCP becomes valuable once you're sharing tools or swapping backends.

### What's the most common reason a first agent fails in production?

Under-scoping and weak tool descriptions. An agent asked to do too much, with vague guidance on when to call which tool, makes confident wrong choices. Narrow the job, write tool descriptions that say exactly when to use each tool, and the failure rate drops sharply.

### How do I know when the agent is ready to ship?

When it passes your eval suite consistently and you've read enough real traces that the runs stop surprising you. If a fresh ticket still produces behavior you didn't anticipate, you're not done — add the case, fix it, and re-run. Ship when surprises become rare, not when the demo looks good.

## Bringing agentic AI to your phone lines

This same step-by-step discipline — scope tightly, declare tools, run a managed loop, gate on evals — is how CallSphere builds **voice and chat** agents that answer every call, look up data mid-conversation, and book work 24/7. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/build-a-claude-managed-agent-step-by-step-walkthrough
