Build a Claude Agent: A Step-by-Step Walkthrough (Building Effective AI Agents)
A linear, copy-paste walkthrough to build a working Claude agent — the loop, tool schemas, robust dispatch, budgets, and tracing, with code and a diagram.
Reading about agent architecture is one thing; getting a blinking cursor to turn into a working agent is another. This is the walkthrough I wish I had the first time — a linear path from an empty Python file to a Claude agent that plans, calls tools, recovers from failures, and stops when it should. No hand-waving, no "and then add error handling later." We build it in the order you would actually build it.
The running example is a small "research assistant" agent that can search a knowledge base and fetch a URL, then synthesize an answer. It is deliberately modest so the scaffolding is visible. Everything here uses the Claude Messages API with the Agent SDK conventions current in 2026.
Key takeaways
- Start with the dumbest possible loop and a single tool — get one full perceive-act-observe cycle working before anything else.
- Define tools as explicit JSON schemas; the description field does more for reliability than any prompt tweak.
- Add a turn cap and token budget before you add a second tool, not after.
- Normalize every tool result — success and failure — into the same structured shape Claude can reason over.
- Trace every turn from day one; you cannot debug an agent you cannot replay.
Step 1 — the bare loop
Begin with the smallest thing that exercises the full cycle: send a prompt, let Claude ask for a tool, run it, feed the result back. Resist adding features. The goal of this step is to see the model request a tool and your code execute it.
from anthropic import Anthropic
client = Anthropic()
TOOLS = [{
"name": "kb_search",
"description": "Search the internal knowledge base. Use for any factual question about our product or docs. Returns up to 5 snippets with source IDs.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]
}
}]
def kb_search(query):
return {"snippets": lookup(query)[:5]}
That description field is not decoration. "Use for any factual question about our product or docs" tells Claude exactly when to reach for this tool versus answering from its own knowledge. Vague descriptions are the number one cause of agents that either never call tools or call them constantly.
Step 2 — drive the loop
Now wrap the call in a loop that handles the tool_use stop reason. This is the engine. Every later feature hangs off this skeleton, so get the control flow exactly right before decorating it.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
def run(goal, max_turns=8):
messages = [{"role": "user", "content": goal}]
for turn in range(max_turns):
resp = client.messages.create(
model="claude-opus-4-8", system=SYSTEM,
messages=messages, tools=TOOLS, max_tokens=2048)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason != "tool_use":
return resp
results = []
for b in resp.content:
if b.type == "tool_use":
out = dispatch(b.name, b.input)
results.append({"type": "tool_result",
"tool_use_id": b.id, "content": json.dumps(out)})
messages.append({"role": "user", "content": results})
return resp # hit the turn cap
The max_turns guard is in place from the first real version. This is intentional. An agent without a turn cap is a billing incident waiting to happen, and you will forget to add it later.
How a single request flows
Before adding more tools, picture what one user goal does as it moves through the code you just wrote. The diagram below maps the exact control flow of the run function, including the two ways the loop can terminate.
flowchart TD
A["Goal in"] --> B["messages.create with tools"]
B --> C{"stop_reason == tool_use?"}
C -->|No| D["Return final answer"]
C -->|Yes| E["dispatch each tool_use block"]
E --> F["Append tool_result to messages"]
F --> G{"turn < max_turns?"}
G -->|Yes| B
G -->|No| D
Two terminal edges both land on "Return final answer": one when Claude is genuinely done, one when the turn cap fires. Make sure your code returns something sensible in the cap case — a partial answer plus a flag is far better than an exception that loses the work.
Step 3 — robust dispatch
The dispatch function is where reliability lives. It must turn every outcome, including exceptions and timeouts, into a structured result the model can read and act on. A tool that throws and crashes the loop is useless; a tool that returns a clean error lets Claude apologize, retry, or try a different approach.
def dispatch(name, args):
try:
fn = REGISTRY[name]
return {"ok": True, "data": fn(**args)}
except KeyError:
return {"ok": False, "error": f"unknown tool {name}"}
except Exception as e:
return {"ok": False, "error": str(e), "retryable": True}
Returning retryable: true is a small signal that pays off enormously. Claude reads it and decides whether to try again or change course, instead of you encoding that policy in brittle runtime code. The same idea extends to richer signals: a requires_human flag tells the agent to escalate rather than thrash, and an empty: true on a zero-result search tells it the query was fine but found nothing — a distinction the model genuinely uses when deciding whether to broaden its search or give up gracefully.
A subtle point about dispatch: keep it pure plumbing. The temptation is to sneak business logic in here — "if the search returned nothing, automatically broaden it." Don't. That decision belongs to the model. The dispatcher's only job is to run the tool and report what happened in a shape the model can read. The moment you start making decisions in dispatch, you have quietly turned your agent back into a hard-coded workflow.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 4 — budgets and tracing
With two or three tools registered, add a token budget that accumulates resp.usage each turn and halts when it crosses a ceiling. Then add a trace: log the full messages array and usage after every turn to a file or table keyed by a run ID. When an agent does something baffling, you replay the trace turn by turn and the cause is almost always obvious within a minute.
Step 5 — add a system prompt with a stop rule
The last piece is the system prompt, and the part people skip is the stop rule. Tell the agent, in plain language, what "done" looks like for this task and what to do when it is stuck. Something as simple as "When you have an answer supported by at least one source, write it and stop. If two searches return nothing useful, say so and stop" prevents the most common failure mode — an agent that keeps searching variations of the same query forever. Pair this prompt-level stop with the runtime turn cap from step 2 and you have two independent guarantees that the loop terminates, which is exactly what you want for anything running unattended.
Common pitfalls
- Adding tools before the loop is solid. Each new tool multiplies the ways the loop can misbehave. Lock the engine first.
- Forgetting to append the assistant message. If you execute the tool but never add the assistant's
tool_useblock tomessages, the next API call is malformed. Append both sides every turn. - Stringifying errors as plain text. Return JSON with an
okflag so the model can branch on it reliably. - No run ID. Without a correlation ID you cannot tie logs, traces, and costs together when investigating.
- Overlong tool descriptions. One or two precise sentences beat a paragraph; Claude reads the whole catalog every turn, so verbosity costs tokens and clarity.
Ship it in five steps
- Write one tool with a sharp description and a hard-coded executor.
- Build the loop with a
max_turnscap and correct message appending. - Wrap dispatch so every result is structured JSON, including errors.
- Add a token budget and a per-turn trace keyed by run ID.
- Register remaining tools one at a time, testing each in isolation before combining.
Single tool vs. full agent
| Capability | One-shot call | Agent loop |
|---|---|---|
| Multi-step tasks | No | Yes |
| Recovers from tool errors | No | Yes |
| Cost predictability | High | Needs a budget |
| Best for | Extraction, classification | Research, automation |
Frequently asked questions
How many tools should my first agent have?
One. Get a single tool working through the full loop, then add the second only once the first is reliable. Most early agent bugs come from too many tools with overlapping descriptions, which forces Claude to guess.
What should max_turns be?
Start low — six to eight — and raise it only if you observe legitimate tasks getting cut off. A low cap surfaces looping bugs early instead of letting them hide behind a generous ceiling.
Should I stream responses during development?
Not at first. Streaming adds parsing complexity that obscures the loop logic. Build and debug with non-streaming calls, then add streaming for the user-facing layer once the agent behaves correctly.
Agentic AI for every conversation
The same loop you just built — decide, call a tool, observe, repeat — is exactly what powers CallSphere's voice and chat agents, which handle live calls, use tools mid-conversation, and book work around the clock. Try it at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.