Build a Claude Computer Use Agent: Step-by-Step Guide
Step-by-step Claude computer use build: set up the display, declare the tool, write the agent loop, dispatch actions, and run a real task end to end.
Plenty of posts explain that Claude can control a computer. Very few hand you the actual loop you need to type into a terminal and watch work. This is the build guide I wish I had on day one: a concrete, end-to-end implementation walkthrough that takes you from an empty directory to a running agent that opens an app, fills a form, and reports back. We will use Python and the Anthropic SDK, but the structure ports to any language with an HTTP client.
Key takeaways
- You need four moving parts: a display, an action executor, a screenshot capture, and the agent loop tying them together.
- The agent loop is a
whilethat runs untilstop_reasonis no longertool_use— everything else is plumbing. - Declare the
computertool with the exact pixel dimensions you screenshot at. - Always append a
tool_resultfor everytool_useblock, in order, or the API rejects the turn. - Add a step cap and timeout before you run anything real; an unbounded loop burns tokens fast.
Step 1 — Stand up an isolated display
Computer use needs a screen to look at. The cleanest approach for development is a Docker container running a virtual framebuffer so nothing touches your real desktop. Start Xvfb on a display, run a lightweight window manager, and launch the apps you want Claude to drive.
Xvfb :99 -screen 0 1280x800x24 &
export DISPLAY=:99
fluxbox &
xterm &
That gives you a 1280×800 virtual screen. Pin those numbers — you will repeat them in the tool declaration and in the screenshot scaling. Mismatched dimensions are the number-one cause of mis-clicks, so decide on the resolution now and never deviate.
Step 2 — Write the action executor and screenshots
The executor turns Claude's requested actions into real input events. On Linux, xdotool handles mouse and keyboard cleanly, and ImageMagick's import or scrot captures the screen. Each branch maps one action type to one shell call.
import base64, subprocess
def screenshot():
subprocess.run(["scrot", "/tmp/s.png"], check=True)
with open("/tmp/s.png", "rb") as f:
return base64.standard_b64encode(f.read()).decode()
def act(action, **kw):
if action == "screenshot":
return screenshot()
if action == "left_click":
x, y = kw["coordinate"]
subprocess.run(["xdotool", "mousemove", str(x), str(y), "click", "1"])
elif action == "type":
subprocess.run(["xdotool", "type", "--delay", "40", kw["text"]])
elif action == "key":
subprocess.run(["xdotool", "key", kw["text"]])
return screenshot()
Every state-changing action ends by returning a fresh screenshot. That is deliberate: Claude only perceives the result of an action through the image you send back, so the screenshot is the feedback signal that keeps the agent grounded.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — The agent loop, visualized
The loop is the spine of the whole thing. It calls the Messages API, inspects the response for tool-use blocks, executes them, packages the results, and goes again. Here is the control flow before we write it.
flowchart TD
A["Append user goal"] --> B["Call Messages API"]
B --> C{"stop_reason == tool_use?"}
C -->|No| H["Print final text, exit"]
C -->|Yes| D["For each tool_use block"]
D --> E["Run act() on the action"]
E --> F["Build tool_result with screenshot"]
F --> G["Append results to messages"]
G --> I{"step < max_steps?"}
I -->|Yes| B
I -->|No| H
The two exit doors are the model deciding it is finished (stop_reason changes) and your guard tripping the step cap. Never ship without the second one.
Step 4 — Declare the tool and call the API
Now wire the SDK call. The tools array contains the computer tool with the same 1280×800 you used for Xvfb. The betas header enables the computer-use tool family.
from anthropic import Anthropic
client = Anthropic()
tools = [{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1280,
"display_height_px": 800,
"display_number": 1,
}]
messages = [{"role": "user",
"content": "Open the text editor and type 'hello from Claude'."}]
resp = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
betas=["computer-use-2025-01-24"],
messages=messages,
)
Use Sonnet for most computer-use work — it balances speed and capability for the long, multi-step visual loops this produces. Reserve Opus for tasks where reasoning over an ambiguous UI is the bottleneck.
Step 5 — Dispatch tool calls and feed results back
When the response stops with tool_use, iterate the content blocks, run each action, and build a matching tool_result. The result for a screenshot carries an image block; the API requires one result per tool-use block, in the same order.
while resp.stop_reason == "tool_use":
messages.append({"role": "assistant", "content": resp.content})
results = []
for block in resp.content:
if block.type == "tool_use":
img = act(block.input["action"], **block.input)
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": [{"type": "image",
"source": {"type": "base64",
"media_type": "image/png", "data": img}}],
})
messages.append({"role": "user", "content": results})
resp = client.beta.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
tools=tools, betas=["computer-use-2025-01-24"], messages=messages)
That snippet is the entire engine. Add a counter around the while, break at your max step count, and you have a safe, runnable computer-use agent.
Choosing where to run it
Where the display lives shapes everything else about deployment.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
| Environment | Setup effort | Best for |
|---|---|---|
| Docker + Xvfb | Low | Local dev, CI, reproducible runs |
| Cloud VM with desktop | Medium | Real apps, persistent sessions |
| Hosted sandbox service | Low (paid) | Scaling many parallel agents |
Start with Docker + Xvfb. It is reproducible, disposable, and keeps a misbehaving agent away from anything that matters.
Common pitfalls
- Forgetting a tool_result. Every
tool_useblock needs a matching result in the next user turn, or the API errors. Loop over all blocks. - Resolution drift. If Xvfb is 1280×800 but you screenshot the full host screen, coordinates miss. Capture only the virtual display.
- No typing delay. Some apps drop characters from instant input. A small
--delayonxdotool typefixes flaky text entry. - Running without a step cap. A confused agent loops forever. Wrap the loop in a counter from the start.
- Plain-text secrets on screen. Anything visible can be read and typed by the agent. Keep credentials out of the sandbox.
Frequently asked questions
Which model should I use for computer use?
Sonnet 4.6 is the workhorse — fast enough for many-step visual loops with strong UI reasoning. Switch to Opus only when a task demands deeper reasoning over a confusing or novel interface.
Do I have to use Docker?
No, but you want isolation. Docker plus Xvfb is the simplest reproducible option. A cloud VM works when you need real apps or persistent sessions.
Why does the agent stop early sometimes?
If stop_reason is not tool_use, Claude believes the task is done or is blocked. Read its final text — it usually explains, and a clearer goal in the prompt often fixes premature stops.
How do I debug a wrong click?
Log the screenshot Claude saw and the coordinate it returned. Overlay the point on the image. Almost always the cause is a dimension mismatch between the tool declaration and the captured image.
Same loop, on the phone
This perceive-act-verify loop is exactly how CallSphere drives voice and chat agents — they take a turn, call a tool, check the result, and continue until the job is booked. Watch one handle a live call at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.