Build a Production MCP Agent with Claude: A Walkthrough
A hands-on walkthrough: define tools, build an MCP server, wire the Agent SDK loop, add idempotency, eval, and go live in stages with Claude.
Plenty of writing explains why MCP agents matter. Far less shows you, step by step, how to stand one up against a system that has real consequences. This walkthrough does that. We will build a support agent that looks up and updates customer subscriptions through an MCP server backed by a production-shaped API, using Claude as the reasoning engine. By the end you will have a working mental blueprint you can adapt: an MCP server, a typed tool surface, an agent loop, guardrails, and a careful path to going live.
We will keep the example concrete but provider-honest. The model is Claude Sonnet 4.6 for the loop and Opus 4.8 reserved for ambiguous escalations. The transport is MCP, the open standard Anthropic introduced in November 2024 for connecting agents to external tools and data. The harness is the Claude Agent SDK. Let us build.
Step 1: Define the tool surface before writing code
Resist the urge to expose your whole API. The agent's tool surface is a product decision, and a narrow one ages well. For our subscription agent, three tools are enough: get_subscription(customer_id), pause_subscription(subscription_id, until), and cancel_subscription(subscription_id, reason). Each gets a precise JSON schema with required fields, types, and tight descriptions. The description is not documentation for humans; it is the only thing Claude reads to decide when and how to call the tool, so write it like an instruction.
Mark the dangerous tools. get_subscription is read-only and safe to call freely. pause and cancel mutate state and deserve confirmation. Encoding this distinction now — in metadata, not in prose — lets the harness enforce it later without parsing intent. A good rule: every tool declares whether it reads or writes, and writes are gated by default.
Step 2: Stand up the MCP server
The MCP server is a small process that exposes those three tools and translates each call into a real API request. It holds the credentials — an API token scoped to subscription operations only — so the model never sees them. Inside each handler you do three things: validate inputs again (never trust that schema validation upstream caught everything), call the backing system, and return a compact structured result the model can reason over. Return errors as data, not exceptions: { "ok": false, "reason": "subscription_already_canceled" } teaches the model to recover, whereas a stack trace teaches it nothing.
flowchart TD
A["Define narrow tool surface"] --> B["Build MCP server with scoped creds"]
B --> C["Wire server into Agent SDK harness"]
C --> D["Add write-gate + idempotency keys"]
D --> E{"Eval suite green?"}
E -->|No| F["Fix prompt/tools, re-run"]
F --> E
E -->|Yes| G["Shadow mode against prod"]
G --> H["Enable writes for cohort, monitor"]
Make every write idempotent. Generate an idempotency key per logical action in the harness and pass it through so a retried cancel_subscription never cancels twice. This single discipline removes an entire class of production incidents that agents are prone to, because agents retry, loop, and occasionally re-issue calls when a turn is ambiguous.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3: Wire the harness and the agent loop
Now connect the server to a Claude Agent SDK harness. The harness registers the server, pulls its tool schemas at startup, and runs the loop: send the system prompt plus tools plus the user message, receive a tool call, dispatch it, append the result, repeat until Claude produces a final reply. The SDK handles the mechanics; your job is the policy around them.
Three policies earn their keep immediately. First, a turn limit — cap the loop at, say, twelve iterations so a confused agent fails fast instead of spinning. Second, a write gate — before dispatching any tool marked as a write, check whether confirmation is required and, if so, pause and surface the proposed action to a human or a confirmation step. Third, structured logging of every tool call and result, so when something goes wrong you can replay exactly what the model saw and did.
Step 4: Write the system prompt as an operating manual
The system prompt is where you set the agent's job, its boundaries, and its escalation rules. Be explicit about what it must never do: "Never cancel a subscription without an explicit cancellation request from the customer. If the customer is merely frustrated, propose a pause and ask." Tell it how to handle missing data: "If get_subscription returns not_found, ask the customer to confirm the account email rather than guessing." Concrete rules beat vague exhortations to "be careful."
Keep the prompt focused on behavior and let the tool schemas carry the mechanics. A common failure is duplicating tool documentation in the prompt, which bloats context and drifts out of sync with the actual schemas. The model already sees the schemas; the prompt should add judgment, not repeat the interface.
Step 5: Evaluate before you trust
Do not let real customers be your test suite. Build a set of evals: scripted scenarios with known-correct outcomes — a clean pause, a cancel that should be refused, a not-found path, a customer who changes their mind mid-conversation. Run the agent against them and assert on the tool calls it makes and the final state, not just the prose. An agent that says the right thing but calls the wrong tool is still broken.
Treat the eval suite as a release gate. A new system prompt, a new tool, or a model upgrade all re-run the suite before shipping. This is how you upgrade from Sonnet to a newer model with confidence: the evals tell you whether behavior held. Without them, every change to a production agent is a leap of faith.
Step 6: Go live in stages
Ship in shadow mode first. Let the agent run against real conversations and propose actions, but execute nothing — log what it would have done and review the diffs against what humans actually did. When the proposed actions match reality consistently, enable real writes for a small cohort, keep the write gate on for the riskiest tool, and watch your logs and metrics closely. Only after the cohort behaves do you widen.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
This staged path turns going live from a single scary switch into a sequence of small, reversible steps. Each stage answers one question — does it reason correctly, does it act correctly, does it act correctly at scale — and you only advance when the answer is yes. That is how an agent earns the right to touch production.
Frequently asked questions
How many tools should my first agent expose?
As few as possible — often three to five. A narrow surface is easier for Claude to use correctly, easier to evaluate, and easier to secure. Add tools only when a real task demands them, and keep dangerous writes separated from safe reads so you can gate them differently.
Why return errors as structured data instead of throwing?
Because the model recovers from data, not from exceptions. A result like { "ok": false, "reason": "already_canceled" } lets Claude adjust its plan and explain the situation to the user. A raw stack trace gives it nothing actionable and often triggers an unhelpful retry loop.
What does shadow mode actually buy me?
It lets the agent reason and propose against real traffic while executing nothing, so you can compare its intended actions to what humans did without risk. It surfaces real-world failure modes that scripted evals miss, and it builds the evidence you need before enabling live writes.
Where do idempotency keys come from?
Generate one per logical action in the harness and pass it through the tool call to the MCP server, which forwards it to the backing system. If the agent retries the same action, the same key prevents a duplicate effect — no double cancellation, no double charge.
From this build to your phone lines
CallSphere takes this exact build-and-go-live discipline to voice and chat: agents that look up accounts, act through scoped tools, and confirm risky changes — all evaluated and rolled out in stages. See a production-grade agent answering live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.