Build a Claude Agent Step by Step: A Real Walkthrough
A hands-on walkthrough: define tools, build the loop, handle failures and idempotency, and ship a Claude agent behind an eval gate.
Most agent tutorials stop at a single tool call and a happy-path demo. That is not an agent; it is a function call with extra steps. This walkthrough takes you all the way to something you could actually deploy: a Claude-powered support triage agent that reads an incoming ticket, looks up the customer, checks recent orders, decides on an action, and either drafts a response or escalates. We build it incrementally so each layer's purpose is obvious, and we stop along the way to fix the failures that always appear in real life.
Step 1: Define the goal and the action space
Before writing code, write down what the agent is allowed to do. Our triage agent's action space is small and deliberate: lookup_customer, get_recent_orders, search_help_articles, draft_reply, and escalate_to_human. Each is a tool with a typed input and output. A tight action space is a feature, not a limitation — it bounds what can go wrong and makes the agent's behavior predictable. Resist the urge to add a generic "run_sql" tool; specific tools with narrow contracts produce far more reliable agents than one powerful, ambiguous one.
Write the system prompt next, and keep it about the job, not about Claude. Describe the agent's role, the order of operations you expect ("always identify the customer before drafting"), the escalation criteria, and the tone. Crucially, state what the agent must not do — never promise refunds over a threshold, never invent order numbers. These constraints become the spine the rest of the build hangs on.
Step 2: Wire one tool and a single turn
Start with one tool and a single model turn so the plumbing is proven before complexity arrives. You declare the tool with a JSON schema for its inputs, send the ticket text plus the tool definition to Claude, and inspect the response. Claude either answers directly or returns a tool-use block naming lookup_customer with arguments it inferred from the ticket. Your code executes the real lookup, then sends the result back in a follow-up message. That round trip — model proposes, you execute, you return — is the atom every agent is built from.
flowchart TD
A["Incoming ticket"] --> B["Build prompt + tool defs"]
B --> C["Claude turn"]
C --> D{"Tool use or final?"}
D -->|Tool use| E["Execute tool, validate result"]
E --> F["Append result to messages"]
F --> C
D -->|Final| G{"Draft or escalate?"}
G -->|Draft| H["Queue reply for review"]
G -->|Escalate| I["Create human ticket"]Step 3: Turn it into a loop
A single turn isn't enough because the agent needs several tools in sequence. Wrap the turn in a loop: as long as Claude returns tool-use blocks, execute them, append the results, and call again. The loop exits when Claude returns a final text answer with no tool calls. Two safeguards go in immediately. A hard maximum on iterations — eight is plenty for triage — prevents runaway loops. And a guard that detects the same tool being called with identical arguments twice in a row, which means the model is stuck and should be nudged or stopped.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Now you handle multiple tool calls in one turn, because Claude will often request lookup_customer and get_recent_orders together. Execute them, ideally in parallel since they're independent, and return all results in a single follow-up. This is where the agent starts to feel intelligent: it plans a small batch of reads, then reasons over the combined results to choose an action.
Step 4: Make failures first-class
This is the step that separates a demo from a product. Real tools fail. The customer lookup 404s because the email was mistyped. The orders service times out. When that happens, do not throw an exception that kills the loop. Instead, return a structured error result to Claude — { "error": "customer_not_found", "hint": "no record for this email" } — and let the model adapt. A well-prompted agent will try searching by a different field, or correctly decide it has insufficient information and escalate. The model handling tool failures gracefully is one of the genuine superpowers of the agentic approach; lean into it.
Add idempotency to every write tool. draft_reply and escalate_to_human both create records, and the loop might retry after a transient network blip. Pass a stable idempotency key derived from the ticket ID and action so a retried call returns the existing record instead of creating a duplicate. Nothing erodes trust in an agent faster than three identical escalation tickets from one event.
Step 5: Add an evaluation gate before you ship
You cannot ship an agent on vibes. Build a small evaluation set of real, anonymized tickets with known-good outcomes: which tools should fire, whether the agent should draft or escalate, and what facts the reply must contain. Run the agent against this set on every change. Score it on task success, not just "did it respond." Did it identify the right customer? Did it escalate the angry refund demand instead of auto-drafting? Did it avoid promising anything it shouldn't?
Treat the eval as a release gate. A prompt tweak that improves one case but regresses three is a net loss you'd never catch by hand. Start with a dozen cases and grow the set every time production surprises you — each incident becomes a permanent regression test. Over a few weeks this eval becomes the most valuable asset in the project, because it lets you change the agent confidently instead of fearfully.
Step 6: Deploy with a human in the loop
For the first release, route every drafted reply through human review before it sends. This is not a failure of nerve; it is how you gather the labeled data that tells you when the agent is trustworthy enough to act on its own. Track agreement rate between the agent's draft and what the human actually sends. When that rate is consistently high for a category of ticket, graduate that category to auto-send while keeping review on the rest. You earn autonomy with evidence.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Frequently asked questions
How many tools should my first agent have?
As few as possible — three to five narrowly-scoped tools. Each tool you add expands the action space and the surface for confusion. Start minimal, ship, and add tools only when an eval failure proves one is missing.
Should I use the Claude Agent SDK or call the API directly?
The SDK gives you the loop, tool execution, and MCP wiring for free, so you spend your time on tools and prompts rather than plumbing. Hand-rolling against the API is worth it only when you need control the SDK doesn't expose. For a first agent, the SDK gets you to a working build far faster.
How do I stop the agent from looping forever?
Two guards: a hard cap on iterations, and detection of repeated identical tool calls. Together they catch nearly every runaway. Log the full trace when either fires so you can see why the model got stuck.
When is the agent ready to act without human review?
When its eval scores are high and its agreement rate with human reviewers is consistently strong for a specific category of task. Graduate categories one at a time rather than flipping autonomy on globally.
Bringing this loop to voice and chat
CallSphere ships agents built on exactly this pattern — tight action spaces, structured error handling, idempotent writes, and eval gates — but for phone and chat, where the agent answers live, looks things up mid-call, and books real work. Hear one in action at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.