Claude Managed Agent Walkthrough: Problem to Shipped
A realistic end-to-end Claude Managed Agent build on a self-hosted sandbox and MCP tunnel — from a support problem to a shipped, measured, gated outcome.
Abstract advice about agents only goes so far. The way you really learn whether self-hosted Claude Managed Agents, sandboxes, and MCP tunnels are worth the effort is to follow one all the way from a concrete problem to something running in production and earning its keep. So that is what this post does. We take a single, ordinary, painful problem and build the agent that solves it, step by step, with the real decisions and tradeoffs called out as they come up.
The problem: a mid-size SaaS company drowns in "where is my data export?" support tickets. Customers request a CSV export of their account, it runs on a backend job queue, and when it is slow or fails, they open a ticket. Agents on the support team spend hours each day checking the job status, re-triggering exports, and pasting the result back. It is repetitive, it follows clear rules, and it touches internal systems — a perfect first agent.
Key takeaways
- A good first agent automates a repetitive, rule-bound task that touches internal systems — not an open-ended creative job.
- The build splits cleanly into tunnel (MCP tools), sandbox (execution), instructions, and evals — in that order.
- Scope the MCP server to exactly the operations the task needs: check status, re-trigger, fetch the download link. Nothing more.
- Ship behind a human approval gate on the customer-facing reply first, then remove it once evals prove reliability.
- Measure success by tickets auto-resolved and time-to-resolution, not by how clever the agent sounds.
Step 1 — Frame the problem as tools, not as a chatbot
The temptation is to build a "support chatbot." Resist it. The real task is a short, bounded procedure: given a ticket about a missing export, look up the customer's most recent export job, decide whether it is queued, failed, or done, and take the appropriate action. Framing it as a procedure tells you exactly which tools the MCP tunnel must expose and nothing more.
We end up with three tools: get_latest_export_job (read-only, returns status and timestamps), retry_export_job (re-triggers a failed or stalled job, idempotent), and get_export_download_url (returns a signed link for a completed job). That is the entire surface the agent needs to reach internal systems. Crucially, there is no generic "run query" or "send email" tool — the dangerous capabilities simply do not exist on this tunnel.
For the record: a Claude Managed Agent here is a Claude-driven worker that executes in a sandbox your team operates and acts on your systems only through a narrow MCP tunnel you define — which is what makes a customer-facing automation auditable and safe to ship.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 2 — Build and scope the MCP tunnel
The MCP server is where safety is decided. Each tool gets the minimum database privilege it needs and validates its inputs. Here is the read-only status tool, the one the agent calls most:
{
"name": "get_latest_export_job",
"description": "Return the most recent export job for a customer.",
"input_schema": {
"type": "object",
"properties": {
"customer_id": { "type": "string", "pattern": "^cus_[A-Za-z0-9]{12}$" }
},
"required": ["customer_id"],
"additionalProperties": false
}
}
The customer_id pattern means the agent cannot inject anything unexpected, and the tool runs against a connection that can only read the jobs table. retry_export_job uses a separate credential allowed to enqueue a job and nothing else, and it is idempotent so a double-call cannot create duplicate exports. This per-tool scoping is the difference between a contained agent and a liability.
Step 3 — The flow, end to end
With the tunnel defined, the agent's actual run is short and deterministic in shape, even though the model decides the path.
flowchart TD
A["New export ticket"] --> B["Agent reads ticket & customer_id"]
B --> C["Call get_latest_export_job"]
C --> D{"Job status?"}
D -->|Completed| E["get_export_download_url, draft reply"]
D -->|Failed/stalled| F["retry_export_job, draft reply"]
D -->|Still running| G["Draft 'in progress' reply with ETA"]
E --> H{"Approval gate"}
F --> H
G --> H
H -->|Approved| I["Reply sent, ticket resolved"]
The agent runs inside a sandbox container with a 60-second timeout, a small tool-call cap, and egress locked to the MCP server only. If anything goes sideways — an unparseable ticket, a tool error — the run aborts and the ticket falls back to a human. Failure is boring, which is exactly what you want.
Step 4 — Write instructions that encode the policy
The model needs the company's actual policy, not generic helpfulness. The instruction makes the decision rules explicit: completed jobs get a download link; failed jobs get retried and the customer is told a new export is running; jobs older than a threshold without completion get escalated to a human rather than retried indefinitely. The instruction also forbids the agent from promising timelines it cannot verify through a tool.
This is where domain knowledge lives, and it is worth iterating on with the support leads who know the edge cases — the enterprise customer whose exports are genuinely huge and slow, the trial account that should be rate-limited. Encoding those exceptions up front prevents the agent from confidently doing the wrong sensible-looking thing.
Step 5 — Evals before, approval gate during, autonomy after
Before this touches a real customer, we build an eval set from historical tickets: thirty real export tickets with known correct outcomes (this one should have been retried, this one was already done, this one needed escalation). Each agent change runs against the set and is graded. A change that drops the correct-action rate does not ship.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
We launch with the approval gate on the customer-facing reply: the agent does all the lookups and drafts the response, a human clicks send. For the first weeks, humans approve nearly everything, which both protects customers and generates more eval data. Once the auto-resolution rate is high and the corrections are rare, the gate comes off for the clear-cut cases (completed and in-progress), leaving only ambiguous escalations for human eyes.
Common pitfalls in a first agent build
- Building a general chatbot instead of a bounded procedure. Scope creep at the design stage produces a tunnel with too many tools and an agent you cannot reason about. Solve one procedure well.
- One shared database credential for all tools. Give read tools read-only connections and the retry tool an enqueue-only connection. Shared admin access erases your containment.
- Shipping without an eval set. If you cannot grade the agent against real past tickets, you cannot tell whether your next prompt tweak helped or hurt. Build the set before launch.
- Removing the approval gate too early. The gate is your safety net and your data source. Keep it until the correction rate is genuinely low, then lift it only for the unambiguous cases.
- Letting the agent invent commitments. Without explicit instruction, a helpful model will promise an ETA it cannot back with a tool. Forbid claims it cannot verify.
Ship your first agent in five steps
- Pick a repetitive, rule-bound task that touches internal systems and has a clear correct outcome.
- Define the minimal MCP tools for that task, each with its own least-privilege credential and validated inputs.
- Run it in a sandbox with timeouts, tool-call caps, and egress locked to the MCP server.
- Encode the real policy and exceptions in the instructions with help from the people who own the process.
- Gate the customer-facing action behind human approval, build an eval set, and only grant autonomy once the numbers earn it.
Frequently asked questions
How do I choose a good first task for a Claude agent?
Pick something repetitive, governed by clear rules, that touches internal systems and has an outcome you can grade. Data lookups, status checks, and rule-based responses are ideal. Avoid open-ended judgment calls and anything irreversible for the very first build.
How long does a build like this take?
The model and instructions are the fast part. The real time goes into scoping the MCP tunnel correctly, wiring per-tool credentials, and building the eval set from historical data. A focused team can ship a gated version in a couple of weeks and reach autonomy over the following weeks.
When is it safe to remove the human approval gate?
When your eval set and live corrections show the agent takes the right action on the clear-cut cases consistently. Lift the gate only for those unambiguous paths first; keep humans on the ambiguous escalations indefinitely.
What if the agent hits a ticket it cannot handle?
It should abort cleanly and hand off to a human, not improvise. Design the run so that any unparseable input, tool error, or out-of-policy situation falls back to the existing manual process. Boring, safe failure is the goal.
From export tickets to phone lines
The same problem-to-shipped path drives CallSphere's voice and chat agents: a bounded task, narrow tools, a sandboxed run, and evals that earn autonomy — answering every call and booking work 24/7. See a working version at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.