Build a Claude Agent SDK Agent: Step-by-Step Guide
Step-by-step: build a working agent with the Claude Agent SDK — setup, system prompt, tools, the run loop, permissions, and production guardrails.
Reading about agent architecture is useful, but nothing replaces typing the code and watching the loop turn. This walkthrough takes you from an empty project to a working agent that can read a codebase, answer a question that requires running a command, and stop cleanly when it is done. I will keep every step concrete and call out the decision you are actually making at each line, because the defaults you pick here shape how the agent behaves in production.
The goal for our example is modest on purpose: an agent that, given a repository, can answer "what does this project's test command do and is it currently passing?" That single task exercises tool selection, shell execution, a multi-turn loop, and a clean stop condition — the same machinery a much larger agent uses.
Step 1: Install and authenticate
Start a fresh project, install the Claude Agent SDK package for your language, and set your Anthropic API key as an environment variable rather than hardcoding it. Pin the SDK version in your manifest so a future update does not silently change loop behavior under you. At this point you have the harness available but no agent yet — the SDK is a library, not a running process, until you configure and start a loop.
Pick your model deliberately. For an agent doing real reasoning over tools, a mid or high-capability model like Sonnet or Opus is appropriate; reserve the smallest model for narrow, high-volume classification steps. The model is a parameter, not a religion — you will tune it after you see the agent's behavior.
Step 2: Write the system prompt as a job description
The system prompt is where you tell the agent who it is, what it is allowed to touch, and — critically — when it is finished. A vague prompt produces a loop that wanders. Write it like a job description: the role, the operating rules, and the definition of done. For our example: "You are a repository analyst. You may read files and run read-only shell commands. Determine the test command and run it once. Report the command and whether it passed. Do not modify files. When you have the answer, state it and stop."
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
That last sentence is load-bearing. The agent loop terminates when the model stops requesting tools, so the prompt must give the model a clear finish line. Without it, agents have a habit of "helpfully" doing more work than asked, burning turns and tokens.
Step 3: Register tools with tight schemas
Our agent needs two capabilities: read a file and run a shell command. Register each as a tool with a name, a description the model will read, and a strict input schema. The description is prompt real estate — "Run a read-only shell command in the repo root; never use it to write or delete" steers the model better than a bare "run command." The schema constrains the shape of the input so malformed calls fail fast rather than reaching your handler.
flowchart TD
A["Init SDK + system prompt"] --> B["Register read_file & run_shell tools"]
B --> C["Send task to agent loop"]
C --> D{"Model requests a tool?"}
D -->|read_file| E["Return file contents"]
D -->|run_shell| F["Permission check, then exec"]
E --> C
F --> C
D -->|No, final answer| G["Print result & stop"]
In your shell handler, enforce the read-only promise in code, not just in the prompt. Reject anything that is not on an allowlist of safe commands, or run inside a sandbox with no write mounts. The model's intent and your enforcement are two separate layers — trust the layer you control.
Step 4: Add a permission callback
Before the runtime executes any tool, route it through a permission callback. For a fully autonomous run you might auto-approve reads and shell commands that match your allowlist, and deny everything else. For an interactive run, you can pause and ask a human. The point is that this callback is the chokepoint for every side effect, so it is where your safety policy lives. Even in our read-only example, wiring it now means you do not have to retrofit safety when the agent grows write capabilities.
Step 5: Start the loop and stream events
Now start the run with the task as the first user message. The SDK drives the loop: model turn, tool request, your handler, result back, repeat. Subscribe to the event stream so you can see each tool call and result in real time. This visibility is not a nice-to-have — it is how you will debug. When the agent picks the wrong test command, the stream shows you the exact moment it guessed, and you fix the prompt or tool description accordingly.
For our task, a healthy run looks like: the agent reads the project manifest to find the test script, runs it once with the shell tool, observes the exit status, and reports back. Three or four turns, then a clean stop. If you see ten turns, your stop condition or tool descriptions need tightening.
Step 6: Add guardrails before you ship
Three guardrails turn a demo into something you can run unattended. First, a turn limit so a confused agent cannot loop indefinitely. Second, a token or cost budget per run, enforced by the SDK, so a runaway task fails loudly instead of expensively. Third, structured logging of every tool call and result, keyed by a run ID, so you can reconstruct any session after the fact. With those three in place, you can run the agent in CI or behind an endpoint without watching it.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
From here, scaling up is additive: connect an MCP server to give the agent real tools like a ticket tracker, add a subagent to parallelize research, or expand the system prompt with domain rules. The loop, the tools, the permission gate, and the guardrails you built in this walkthrough stay exactly the same.
Frequently asked questions
How many tools should my first agent have?
As few as the task needs — often two or three. The model selects more accurately from a small, clearly described catalog. Add tools only when a real task fails for lack of one, and keep each description specific about what the tool does and what it must never do.
Why does my agent keep working after it has the answer?
Almost always because the system prompt never defines "done." State the finish condition explicitly — "once you have X, report it and stop" — and the model will return a final answer instead of requesting more tools.
Should I enforce safety in the prompt or in code?
Both, but trust the code. The prompt steers the model's intent; your tool handler and permission callback enforce the actual boundary. A read-only agent should reject write commands in the handler even if the prompt already forbade them.
How do I debug a run that went wrong?
Stream and log every tool call and result with a run ID. Replay the sequence to find the exact turn where the agent chose wrong, then fix the upstream cause — usually a fuzzy tool description, a missing stop condition, or a tool that returned ambiguous output.
Bringing agentic AI to your phone lines
The same build-loop you just walked — define the role, register tools, gate side effects, set budgets — is how CallSphere builds voice and chat agents that answer every call, fetch live data mid-conversation, and book work 24/7. See it in action at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.