How Claude Computer Use Works: Architecture Internals
Inside Claude computer use: the screenshot-action loop, the computer tool, coordinate grounding, and the harness architecture that makes it reliable.
The first time you watch Claude move a mouse pointer across a desktop, click a menu, and type into a text field, it feels like magic. It is not. Underneath, computer use is a disciplined loop of screenshots, coordinate predictions, and tool results that flow back into the model's context. If you want to build something reliable on top of it — rather than a flashy demo that falls apart on the third click — you need to understand exactly how the pieces fit together. This post walks the whole architecture, from the moment a prompt arrives to the moment Claude decides the task is done.
Key takeaways
- Computer use is an agentic loop: Claude requests an action, your harness executes it on a real machine, and the result (usually a screenshot) returns as a tool result.
- The model never touches the OS directly — your code is the actuator, which is also your security boundary.
- Coordinates are predicted by the model from pixels; image resolution and scaling are the single biggest source of mis-clicks.
- State lives in the conversation history, not the machine; the screenshot after each action is how Claude perceives the world.
- A robust harness needs a display layer, an action executor, a screenshot pipeline, and a stop condition — get all four right and the rest is prompt design.
What computer use actually is
Computer use is a capability where Claude controls a graphical computer the way a person would — by looking at the screen and issuing mouse and keyboard actions — rather than calling a structured API. In practical terms, Anthropic ships a computer tool definition with a fixed schema; you implement the other side. Claude emits actions like screenshot, mouse_move, left_click, type, and key, each with parameters, and your harness turns those into real input events on a display.
The crucial mental model: Claude is the planner and the eyes; your harness is the hands. The model decides "click the blue Submit button near the bottom" and translates that intent into an (x, y) coordinate. Your code executes the click on a live X server, virtual framebuffer, or remote desktop and sends back proof of what happened. There is no hidden channel — everything Claude knows about the screen comes from images you pass in.
The screenshot-action loop
At the heart of every computer-use session is a single repeating cycle. The model takes a turn, asks for one or more actions, your harness runs them, captures the new screen state, and feeds it back. The loop continues until Claude stops requesting tool calls or your harness trips a guard. This is the architecture you are really building.
flowchart TD
A["User goal + system prompt"] --> B["Claude plans next action"]
B --> C{"Action type?"}
C -->|screenshot| D["Capture display"]
C -->|click / type / key| E["Inject input event on OS"]
E --> D
D --> F["Return image as tool_result"]
F --> G{"Goal met or guard tripped?"}
G -->|No| B
G -->|Yes| H["Return final text answer"]
Notice that nearly every action ends with a screenshot. Claude is effectively blind between turns, so the screenshot is its only feedback signal. If you skip the screenshot to save tokens, the model is acting on stale perception and will drift. The discipline of "act, then look" is what keeps long sessions on track.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The computer tool and coordinate grounding
The computer tool is declared with a display width, height, and an optional display number. Those dimensions matter enormously. Claude predicts coordinates in the pixel space of the image it receives, so if you tell the tool the display is 1280×800 but actually send a 1920×1080 screenshot, every coordinate will be off by the scaling ratio and clicks will land in the wrong place.
{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1280,
"display_height_px": 800,
"display_number": 1
}
The reliable pattern is to standardize on a target resolution well within the model's comfortable range, render the real desktop, then downscale screenshots to exactly the declared dimensions before sending. When Claude returns a click at (640, 410), your harness maps it back to native coordinates if you upscaled. Keeping the declared size and the image size identical removes an entire class of bugs.
The orchestration layer your code owns
Around the model loop sits the infrastructure you are responsible for. A production harness has four parts: a display layer (a virtual framebuffer like Xvfb, a container with a desktop, or a remote VM); an action executor that translates tool calls into OS events (xdotool, pyautogui, or a platform API); a screenshot pipeline that captures, scales, and encodes images; and an agent loop that calls the Messages API, dispatches tool calls, appends tool results, and enforces stop conditions.
The agent loop is where most of the engineering lives. It maintains the message list, detects stop_reason: "tool_use", executes each requested action in order, builds the corresponding tool_result blocks, and re-invokes the model. It also enforces a maximum step count, a wall-clock timeout, and an action allowlist. Treat that loop as the kernel of your system — everything else plugs into it.
Why state lives in the transcript, not the machine
One subtle architectural fact trips up newcomers: Claude has no memory of the machine between API calls beyond what is in the conversation. The desktop is stateful — files move, windows open — but the model only perceives that state through the screenshots accumulated in the transcript. This has two consequences. First, the context window fills with images quickly; a 30-step task can carry dozens of screenshots, and you often want to prune older ones to control cost. Second, recovery is purely visual: if an action fails, the next screenshot shows the unchanged screen, and Claude can notice the failure and retry — but only if you actually send that screenshot back.
Comparing computer use to structured tool calls
It helps to see where this architecture is the right choice versus a normal API tool. Computer use is general but slow and probabilistic; structured tools are fast and exact but require an interface to exist.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
| Dimension | Computer use | Structured tool / API |
|---|---|---|
| Interface needed | Just a screen | A defined endpoint |
| Reliability | Probabilistic, can mis-click | Deterministic |
| Speed per step | Seconds (screenshot round-trips) | Milliseconds |
| Best for | Legacy GUIs, no API exists | Anything with an API |
The honest guidance: reach for an MCP server or REST tool whenever the underlying system exposes one, and reserve computer use for the cases where the only interface a human (or Claude) gets is the pixels on the screen.
Common pitfalls
- Mismatched dimensions. Declaring one display size and sending screenshots at another scrambles every coordinate. Pin them to the same number.
- Skipping post-action screenshots. Without fresh visual feedback, Claude acts blind and compounds errors. Always return a screenshot after a state-changing action.
- Unbounded loops. A model stuck on a confusing UI can loop forever. Always cap step count and wall-clock time in the harness.
- Letting context explode. Dozens of full-resolution screenshots blow past budget. Prune or summarize older images and keep only the recent ones.
- No security boundary. Your executor runs real input on a real machine. Sandbox it; never point it at a production desktop with credentials in plain sight.
Ship a working harness in 6 steps
- Stand up an isolated display — a container running Xvfb plus a window manager works well.
- Pick a fixed target resolution and declare the
computertool with those exact dimensions. - Write the action executor mapping each tool action to an OS input call.
- Build the screenshot pipeline: capture, downscale to target size, base64-encode as PNG.
- Implement the agent loop: call the Messages API, dispatch tool calls, append tool results, repeat.
- Add guards — max steps, timeout, action allowlist — then test on a real task end to end.
Frequently asked questions
Does Claude run code on my machine directly during computer use?
No. Claude only emits tool-call requests. Your harness executes them, which means you control exactly what is allowed and you own the security boundary. Keep the executor sandboxed.
How does Claude know where to click?
It predicts pixel coordinates from the screenshot you send. Accuracy depends heavily on the image matching the declared display dimensions, so keep those identical and avoid odd aspect ratios.
Why is every step so slow compared to an API call?
Each step involves a full screenshot round-trip through the model, which processes a sizable image. That is inherent to visual control; prefer structured tools whenever a real API exists.
Can I reduce token cost on long sessions?
Yes — prune older screenshots from the transcript, downscale images, and keep only the most recent visual state plus a short text summary of progress.
From desktops to dial tones
The screenshot-act-observe loop that powers computer use is the same agentic pattern CallSphere runs on voice and chat — assistants that perceive a conversation, take tool-backed actions mid-call, and confirm the result before moving on. See it answering real calls at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.