Build a Claude Coding Agent: Step-by-Step Walkthrough

Reading about agent architecture is one thing; building a working coding agent is another. This is a hands-on walkthrough. By the end you'll have the skeleton of an agent that reads a repository, edits files, runs tests, and iterates until they pass — the same loop that powers the systems topping coding benchmarks. We'll use the Claude API with tool use, but the pattern transfers to the Claude Agent SDK if you'd rather not hand-roll the loop.

I'll keep the code real and minimal. Every snippet here is something you can paste and adapt, not pseudocode. The goal is to demystify the loop so you can extend it for your own stack.

Key takeaways

A coding agent is roughly 150 lines of glue: a tool registry, a turn loop, and a verifier.
Define tools with strict JSON schemas so Claude returns structured, executable calls.
Run every tool inside a sandbox; never let the model touch your real shell unguarded.
Feed tool results back as tool_result blocks so the model sees what happened.
Gate completion on a real test run, not on the model saying "done."

Step 1: define the tools Claude can call

Start by declaring the actions your agent is allowed to take. Keep the set small and orthogonal. For a coding agent, four tools cover most tasks: read a file, write a file, run a shell command, and run the test suite. Each gets a JSON schema so Claude knows exactly what arguments to produce.

tools = [
  {
    "name": "read_file",
    "description": "Read a file from the repo. Returns its full contents.",
    "input_schema": {
      "type": "object",
      "properties": {"path": {"type": "string"}},
      "required": ["path"]
    }
  },
  {
    "name": "write_file",
    "description": "Overwrite a file with new contents.",
    "input_schema": {
      "type": "object",
      "properties": {
        "path": {"type": "string"},
        "content": {"type": "string"}
      },
      "required": ["path", "content"]
    }
  },
  {
    "name": "run_tests",
    "description": "Run the project's test suite. Returns pass/fail and output.",
    "input_schema": {"type": "object", "properties": {}}
  }
]

The descriptions matter as much as the schemas — they're how Claude decides which tool fits the current sub-goal. Write them like API docs for a junior engineer.

Step 2: wire the agent loop

The loop is the engine. You send the conversation plus the tool definitions to Claude; if it returns a tool call, you execute it and append the result; if it returns a final text answer with no tool call, you're done. Here's the shape of that loop, and a diagram of the control flow.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Send messages + tools to Claude"] --> B{"stop_reason?"}
  B -->|tool_use| C["Dispatch to tool handler"]
  C --> D["Execute in sandbox"]
  D --> E["Append tool_result block"]
  E --> A
  B -->|end_turn| F{"Tests green?"}
  F -->|No| G["Inject failure, continue"] --> A
  F -->|Yes| H["Return final diff"]

In Python the core is short. Note how tool results are appended as a user message containing tool_result blocks — that's how Claude perceives the outcome of its own action.

while True:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        tools=tools,
        messages=messages,
    )
    if resp.stop_reason != "tool_use":
        break
    results = []
    for block in resp.content:
        if block.type == "tool_use":
            out = dispatch(block.name, block.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": out,
            })
    messages.append({"role": "assistant", "content": resp.content})
    messages.append({"role": "user", "content": results})

Step 3: implement the tool handlers in a sandbox

The dispatch function maps a tool name to a Python function. The critical rule: every handler runs against a sandboxed checkout, never your real working tree. Use a temporary git clone or a container so a bad edit can't damage anything. Read and write are trivial; the shell handler should enforce a timeout and capture both stdout and stderr.

def dispatch(name, args):
    if name == "read_file":
        return open(os.path.join(SANDBOX, args["path"])).read()
    if name == "write_file":
        path = os.path.join(SANDBOX, args["path"])
        open(path, "w").write(args["content"])
        return "written"
    if name == "run_tests":
        r = subprocess.run(["pytest", "-q"], cwd=SANDBOX,
                           capture_output=True, text=True, timeout=300)
        return r.stdout[-4000:] + "\nEXIT=" + str(r.returncode)

Truncating the test output to the last few thousand characters is deliberate — pytest can emit thousands of lines, and you don't want to flood the context window. Keep the tail, which holds the failures.

Step 4: add the verification gate

This is the step that turns a code generator into a coding agent. Don't trust the model when it says the task is finished. After it stops calling tools, run the tests yourself one more time. If they fail, push the failure back into the conversation and let the loop continue. Only return a diff when the suite is genuinely green.

This gate is why benchmark-grade agents are reliable: correctness is decided by the test runner, not by the model's confidence. Wire it as an outer check around the inner loop, exactly as the diagram shows.

One refinement pays off immediately: when you push a failure back, include the structured summary, not the raw log, and frame it as a fresh instruction — "these two tests still fail; here are the assertions; fix them." This keeps the agent oriented on the gap between current and desired state instead of re-reading thousands of lines. It also prevents a common stall where the agent, seeing a wall of output, decides to read files it already understands rather than making the next edit.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Step 6: handle the long tail — timeouts, flakes, and partial edits

A skeleton that works on a clean task still needs hardening for the real world. Three failure modes show up fast. Test commands hang, so every shell call needs a timeout and the agent needs to treat a timeout as a distinct, retryable signal rather than a failed assertion. Tests flake, so a single red run shouldn't always reset the agent's plan — re-running once before reacting filters out noise. And edits go partial: the model writes half a change, the tests fail in a new way, and the agent must read the current file state rather than assume its last write landed cleanly. Building these in early is far cheaper than debugging a confused agent later.

Step 5: seed the system prompt and run

Give the agent a focused system prompt: who it is, the repo layout, the definition of done ("all tests pass"), and a reminder to explore before editing. Then kick it off with the issue text as the first user message. The table below shows what to expect as you scale this skeleton up.

Build stage	What you have	Typical gap
Loop only	Agent that edits and runs tests	Wastes turns guessing file paths
+ search tool	Navigates the repo efficiently	Context fills with stale output
+ context trimming	Stays coherent over many steps	Occasional unsafe shell command
+ sandbox & limits	Production-ready coding agent	Tuning prompts for your stack

Common pitfalls

Skipping the sandbox. Running tool calls against your real repo will eventually corrupt it. Always operate on a disposable checkout.
Forgetting tool_result blocks. If you don't append results in the exact block format, Claude can't see what its action did and will loop blindly.
Dumping full logs into context. Truncate and summarize tool output, or the window fills with noise and quality collapses.
Trusting "done." Without an independent test run after the loop, the agent will declare success on code that doesn't compile.
No turn cap. Add a max-turns limit; a confused agent can loop indefinitely and burn tokens.

A coding agent's reliability comes not from a smarter model but from a disciplined loop: structured tools, a sandbox, trimmed context, and a verification gate that has the final word.

Frequently asked questions

Do I need the Agent SDK or can I hand-roll this?

You can hand-roll it with the raw Messages API exactly as shown — it's about 150 lines. The Claude Agent SDK saves you from rebuilding the loop, sandboxing, and context management yourself, so reach for it once your prototype works and you want production hardening.

Which model should the agent use?

Use Sonnet 4.6 for most agentic coding — it's fast and strong at tool calling. Switch to Opus 4.8 for the hardest reasoning-heavy tasks where the extra capability pays for the cost. Many teams route by difficulty.

How do I stop runaway token usage?

Cap turns, truncate tool output, and evict stale observations from the message list. The agent loop should never carry the full history of every file it has ever read.

Putting agentic loops on the phone

CallSphere runs this same build-a-loop discipline for voice and chat — agents that take an action, observe the result, and keep going until the caller's request is genuinely handled, not just acknowledged. See it working at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Build a Claude Coding Agent: Step-by-Step Walkthrough

Key takeaways

Step 1: define the tools Claude can call

Step 2: wire the agent loop

Step 3: implement the tool handlers in a sandbox

Step 4: add the verification gate

Step 6: handle the long tail — timeouts, flakes, and partial edits

Step 5: seed the system prompt and run

Common pitfalls

Frequently asked questions

Do I need the Agent SDK or can I hand-roll this?

Which model should the agent use?

How do I stop runaway token usage?

Putting agentic loops on the phone

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild