Build a Claude Computer Use Agent: Step-by-Step Guide

Plenty of posts explain that Claude can control a computer. Very few hand you the actual loop you need to type into a terminal and watch work. This is the build guide I wish I had on day one: a concrete, end-to-end implementation walkthrough that takes you from an empty directory to a running agent that opens an app, fills a form, and reports back. We will use Python and the Anthropic SDK, but the structure ports to any language with an HTTP client.

Key takeaways

You need four moving parts: a display, an action executor, a screenshot capture, and the agent loop tying them together.
The agent loop is a while that runs until stop_reason is no longer tool_use — everything else is plumbing.
Declare the computer tool with the exact pixel dimensions you screenshot at.
Always append a tool_result for every tool_use block, in order, or the API rejects the turn.
Add a step cap and timeout before you run anything real; an unbounded loop burns tokens fast.

Step 1 — Stand up an isolated display

Computer use needs a screen to look at. The cleanest approach for development is a Docker container running a virtual framebuffer so nothing touches your real desktop. Start Xvfb on a display, run a lightweight window manager, and launch the apps you want Claude to drive.

Xvfb :99 -screen 0 1280x800x24 &
export DISPLAY=:99
fluxbox &
xterm &

That gives you a 1280×800 virtual screen. Pin those numbers — you will repeat them in the tool declaration and in the screenshot scaling. Mismatched dimensions are the number-one cause of mis-clicks, so decide on the resolution now and never deviate.

Step 2 — Write the action executor and screenshots

The executor turns Claude's requested actions into real input events. On Linux, xdotool handles mouse and keyboard cleanly, and ImageMagick's import or scrot captures the screen. Each branch maps one action type to one shell call.

import base64, subprocess

def screenshot():
    subprocess.run(["scrot", "/tmp/s.png"], check=True)
    with open("/tmp/s.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode()

def act(action, **kw):
    if action == "screenshot":
        return screenshot()
    if action == "left_click":
        x, y = kw["coordinate"]
        subprocess.run(["xdotool", "mousemove", str(x), str(y), "click", "1"])
    elif action == "type":
        subprocess.run(["xdotool", "type", "--delay", "40", kw["text"]])
    elif action == "key":
        subprocess.run(["xdotool", "key", kw["text"]])
    return screenshot()

Every state-changing action ends by returning a fresh screenshot. That is deliberate: Claude only perceives the result of an action through the image you send back, so the screenshot is the feedback signal that keeps the agent grounded.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Step 3 — The agent loop, visualized

The loop is the spine of the whole thing. It calls the Messages API, inspects the response for tool-use blocks, executes them, packages the results, and goes again. Here is the control flow before we write it.

flowchart TD
  A["Append user goal"] --> B["Call Messages API"]
  B --> C{"stop_reason == tool_use?"}
  C -->|No| H["Print final text, exit"]
  C -->|Yes| D["For each tool_use block"]
  D --> E["Run act() on the action"]
  E --> F["Build tool_result with screenshot"]
  F --> G["Append results to messages"]
  G --> I{"step < max_steps?"}
  I -->|Yes| B
  I -->|No| H

The two exit doors are the model deciding it is finished (stop_reason changes) and your guard tripping the step cap. Never ship without the second one.

Step 4 — Declare the tool and call the API

Now wire the SDK call. The tools array contains the computer tool with the same 1280×800 you used for Xvfb. The betas header enables the computer-use tool family.

from anthropic import Anthropic
client = Anthropic()

tools = [{
    "type": "computer_20250124",
    "name": "computer",
    "display_width_px": 1280,
    "display_height_px": 800,
    "display_number": 1,
}]

messages = [{"role": "user",
    "content": "Open the text editor and type 'hello from Claude'."}]

resp = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    betas=["computer-use-2025-01-24"],
    messages=messages,
)

Use Sonnet for most computer-use work — it balances speed and capability for the long, multi-step visual loops this produces. Reserve Opus for tasks where reasoning over an ambiguous UI is the bottleneck.

Step 5 — Dispatch tool calls and feed results back

When the response stops with tool_use, iterate the content blocks, run each action, and build a matching tool_result. The result for a screenshot carries an image block; the API requires one result per tool-use block, in the same order.

while resp.stop_reason == "tool_use":
    messages.append({"role": "assistant", "content": resp.content})
    results = []
    for block in resp.content:
        if block.type == "tool_use":
            img = act(block.input["action"], **block.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": [{"type": "image",
                    "source": {"type": "base64",
                        "media_type": "image/png", "data": img}}],
            })
    messages.append({"role": "user", "content": results})
    resp = client.beta.messages.create(
        model="claude-sonnet-4-6", max_tokens=1024,
        tools=tools, betas=["computer-use-2025-01-24"], messages=messages)

That snippet is the entire engine. Add a counter around the while, break at your max step count, and you have a safe, runnable computer-use agent.

Choosing where to run it

Where the display lives shapes everything else about deployment.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Environment	Setup effort	Best for
Docker + Xvfb	Low	Local dev, CI, reproducible runs
Cloud VM with desktop	Medium	Real apps, persistent sessions
Hosted sandbox service	Low (paid)	Scaling many parallel agents

Start with Docker + Xvfb. It is reproducible, disposable, and keeps a misbehaving agent away from anything that matters.

Common pitfalls

Forgetting a tool_result. Every tool_use block needs a matching result in the next user turn, or the API errors. Loop over all blocks.
Resolution drift. If Xvfb is 1280×800 but you screenshot the full host screen, coordinates miss. Capture only the virtual display.
No typing delay. Some apps drop characters from instant input. A small --delay on xdotool type fixes flaky text entry.
Running without a step cap. A confused agent loops forever. Wrap the loop in a counter from the start.
Plain-text secrets on screen. Anything visible can be read and typed by the agent. Keep credentials out of the sandbox.

Frequently asked questions

Which model should I use for computer use?

Sonnet 4.6 is the workhorse — fast enough for many-step visual loops with strong UI reasoning. Switch to Opus only when a task demands deeper reasoning over a confusing or novel interface.

Do I have to use Docker?

No, but you want isolation. Docker plus Xvfb is the simplest reproducible option. A cloud VM works when you need real apps or persistent sessions.

Why does the agent stop early sometimes?

If stop_reason is not tool_use, Claude believes the task is done or is blocked. Read its final text — it usually explains, and a clearer goal in the prompt often fixes premature stops.

How do I debug a wrong click?

Log the screenshot Claude saw and the coordinate it returned. Overlay the point on the image. Almost always the cause is a dimension mismatch between the tool declaration and the captured image.

Same loop, on the phone

This perceive-act-verify loop is exactly how CallSphere drives voice and chat agents — they take a turn, call a tool, check the result, and continue until the job is booked. Watch one handle a live call at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Build a Claude Computer Use Agent: Step-by-Step Guide

Key takeaways

Step 1 — Stand up an isolated display

Step 2 — Write the action executor and screenshots

Step 3 — The agent loop, visualized

Step 4 — Declare the tool and call the API

Step 5 — Dispatch tool calls and feed results back

Choosing where to run it

Common pitfalls

Frequently asked questions

Which model should I use for computer use?

Do I have to use Docker?

Why does the agent stop early sometimes?

How do I debug a wrong click?

Same loop, on the phone

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild