Build a Production Claude Agent: Step-by-Step Guide
A follow-along walkthrough to build a Claude agent: scaffold the loop, wire tools, add an MCP server, and ship safely with evals and guardrails.
Plenty of articles tell you what a Claude agent is. Far fewer hand you the actual sequence of steps to build one that survives contact with production. This is that sequence. We will start from an empty project and end with a Claude agent that takes a support ticket, looks up an order in a real system, drafts a resolution, and stops cleanly when it is done — with the loop, error handling, and guardrails an engineer can ship. Follow the steps in order; each one builds on the last, and skipping the boring ones is how demos turn into outages.
Step 1: Pin the task contract before any code
Before you call the model once, write down the contract in plain language: what input the agent receives, what it is allowed to do, and what a correct output looks like. For our example: input is a ticket with a customer email and a question; allowed actions are looking up orders and reading the knowledge base; output is a JSON object with a drafted reply and a confidence flag. This contract becomes your system prompt, your tool list, and your success check all at once. Teams that skip this step end up with agents that do plausible-looking work nobody can verify.
Write the contract tightly. "Resolve the ticket" is not a contract; "produce a reply that cites at least one order fact or knowledge-base article, or set needs_human to true" is. The narrower the contract, the easier every later step becomes.
Step 2: Scaffold the agent loop
The core of any Claude agent is a loop that calls the model, checks for a tool request, runs it, and feeds the result back. Use the Claude Agent SDK, which gives you these primitives on top of Claude Code, or hand-roll the loop against the Messages API. Either way the shape is the same and you should write it explicitly so you control termination.
flowchart TD
A["Receive ticket"] --> B["Build initial context + tool schemas"]
B --> C["Call Claude"]
C --> D{"Tool requested?"}
D -->|No| G["Validate output against contract"]
D -->|Yes| E["Run tool with try/catch"]
E --> F["Append result or error to context"]
F --> H{"Turn limit reached?"}
H -->|No| C
H -->|Yes| G
G --> I["Return reply or escalate"]
In code, the loop reads almost exactly like the diagram. Maintain a messages array, append the model's response and each tool result to it, and break the loop on a final text answer or a hard turn limit. The two non-negotiables here are the turn cap and wrapping every tool execution in error handling so a single bad call never kills the run.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3: Write your first tool and its schema
Define the order-lookup tool with a precise JSON schema: a name, a one-sentence description Claude will read to decide when to use it, and typed parameters. Claude chooses tools largely from the description, so write it for a reader who cannot see your code — "Look up an order by its ID and return status, items, and ship date" beats "order tool." Validate the model's arguments against the schema before you execute anything; never trust the arguments blindly just because they came from the model.
Make the tool's return value explicit and structured. Return a small JSON object with the fields the agent needs, not a raw database row dump. If the order is not found, return {"found": false} rather than throwing, so the model can reason about the miss instead of seeing an opaque stack trace. The quality of your tool's output schema directly shapes the quality of the model's next decision.
Step 4: Add an MCP server for the systems you do not own
For systems you would rather not hand-wrap — a database, a ticketing platform, an internal API — connect an MCP server instead of writing bespoke tool code. Model Context Protocol lets Claude talk to these systems through a standard interface, so you register the server once and its tools become available to the agent automatically. In Claude Code or the Agent SDK, you add the server to configuration; the runtime handles discovery of its tools and forwards calls.
When you wire an MCP server, handle three things explicitly: authentication (pass credentials through environment or a secret store, never in the prompt), schema review (read the tools the server exposes and decide which ones this agent should actually be allowed to call), and a timeout so a slow server cannot hang your loop. Treat the MCP server as an untrusted dependency — its tools are powerful, and you are responsible for scoping what your agent can reach.
Step 5: Assemble context deliberately
Now compose what the model sees each turn. Put the task contract and constraints in the system prompt, where it stays stable enough to benefit from prompt caching. Add the current ticket. Advertise only the tools this agent needs — three sharp tools beat fifteen vague ones. As the loop runs and tool results accumulate, keep the context focused: once an order is looked up, you do not need to re-list every tool's full schema in your own reasoning, and you can summarize long knowledge-base passages down to the relevant paragraph. Lean context keeps both attention and cost under control.
If your agent grows several capabilities, reach for Agent Skills. A skill is a folder of instructions and scripts Claude loads only when a task signals it is relevant, so you can add a refund-handling skill without bloating the baseline context of every ticket the agent touches.
Step 6: Add termination, verification, and escalation
An agent that cannot stop is a liability. Enforce a maximum turn count, a token budget, and a deadline. When the loop produces a final answer, validate it against the contract from Step 1: does the reply cite a fact, or is needs_human set? If neither, do not ship the reply — escalate. Build the unhappy path now, because in production the unhappy path is most of the traffic. An agent that confidently escalates the 10% it cannot handle is far more valuable than one that fabricates answers to all 100%.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 7: Instrument, eval, then ship
Before this agent touches a customer, capture a structured trace of every run: the assembled context, the model's action, each tool call and result, and the final decision. Collect a few dozen real tickets, run the agent over them, and read the traces. You will find the same surprises every team finds — a tool description the model misread, a context value that went stale, an edge case the contract missed. Fix those, turn the labeled tickets into a small eval set, and gate future changes on it. When a new model like Opus 4.8 or a prompt tweak lands, you replay the evals and ship on green rather than on vibes.
Frequently asked questions
Do I need the Claude Agent SDK or can I use the API directly?
You can do either. The Agent SDK gives you the loop, tool handling, and MCP wiring out of the box on top of Claude Code primitives, which saves real time. Rolling your own against the Messages API gives you maximum control. Many teams start with the SDK and drop down to the raw API only where they need custom behavior.
How many tools should my first agent have?
As few as the contract allows — often two or three. Each extra tool adds schema tokens and a chance for the model to pick the wrong action. Add tools only when a real task needs them, and write each description as if the model has never seen your code.
What is the single most common mistake in step one builds?
No termination condition. New agents that loop until the model stops asking for tools will, on some inputs, never stop. Add a turn cap and a deadline from the very first version.
How do I know the agent is ready for production?
When it passes a real eval set built from actual tasks, escalates cleanly on the cases it cannot handle, and produces a readable trace for every run. Readiness is about verifiable behavior on real inputs, not a clean demo.
Putting agentic AI on the phone
This same build sequence — contract, loop, tools, MCP, context, guardrails, evals — is how CallSphere ships voice and chat agents that answer every call, pull live data mid-conversation, and book work without a human in the loop. Watch a fully built agent in action at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.