---
title: "Claude Computer Use for Agents: Automating Desktop and Browser Tasks"
description: "Learn how to build agents that use Claude's computer use capability to analyze screenshots, map coordinates, execute mouse and keyboard actions, and verify results on desktop and browser interfaces."
canonical: https://callsphere.ai/blog/claude-computer-use-agents-automating-desktop-browser
category: "Learn Agentic AI"
tags: ["Claude", "Computer Use", "Browser Automation", "Desktop Automation", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T13:46:52.463Z
---

# Claude Computer Use for Agents: Automating Desktop and Browser Tasks

> Learn how to build agents that use Claude's computer use capability to analyze screenshots, map coordinates, execute mouse and keyboard actions, and verify results on desktop and browser interfaces.

## What is Claude Computer Use

Claude computer use allows an AI agent to interact with a computer the way a human does — by looking at screenshots and performing mouse clicks, keyboard input, and scrolling. Instead of calling APIs or parsing HTML, the agent sees the screen as an image and decides what actions to take based on visual understanding.

This capability is useful for automating legacy applications that lack APIs, testing web applications, filling out forms across multiple websites, and any workflow where a human would normally sit at a computer clicking through screens.

## How Computer Use Works

The workflow follows a perception-action loop:

```mermaid
flowchart LR
    GOAL(["High level goal"])
    PLAN["Planner LLM"]
    SCREEN["Screen capture
every step"]
    VLM["Vision LLM
reads UI state"]
    ACT{"Action type"}
    CLICK["Click coordinate"]
    TYPE["Type text"]
    KEY["Keyboard shortcut"]
    GUARD["Safety filter
allow lists"]
    OS[("OS sandbox
ephemeral VM")]
    DONE(["Goal verified"])
    GOAL --> PLAN --> SCREEN --> VLM --> ACT
    ACT --> CLICK --> GUARD
    ACT --> TYPE --> GUARD
    ACT --> KEY --> GUARD
    GUARD --> OS --> SCREEN
    OS --> DONE
    style PLAN fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style DONE fill:#059669,stroke:#047857,color:#fff
```

1. Your code takes a screenshot of the screen
2. The screenshot is sent to Claude as an image
3. Claude analyzes the screenshot and decides what action to take
4. Your code executes that action (click, type, scroll)
5. A new screenshot is taken and the loop repeats

Claude uses a special `computer_20250124` tool that defines the available actions. The tool specification tells Claude the screen dimensions so it can map visual elements to pixel coordinates.

## Setting Up the Computer Use Tool

```python
import anthropic
import base64
import subprocess
import json

client = anthropic.Anthropic()

# Define the computer use tool with your screen dimensions
computer_tool = {
    "type": "computer_20250124",
    "name": "computer",
    "display_width_px": 1920,
    "display_height_px": 1080,
    "display_number": 0,
}

def take_screenshot() -> str:
    """Capture the screen and return base64-encoded PNG."""
    subprocess.run(["scrot", "/tmp/screenshot.png", "-o"], check=True)
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode()
```

The `display_width_px` and `display_height_px` must match your actual screen resolution. Claude uses these dimensions to calculate pixel coordinates for clicks.

## Building the Computer Use Agent Loop

The agent loop sends screenshots to Claude and executes the returned actions:

```python
def run_computer_agent(task: str, max_steps: int = 30):
    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        # Take a screenshot
        screenshot_b64 = take_screenshot()

        # Add screenshot to the conversation
        screenshot_message = {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    }
                },
                {"type": "text", "text": "Here is the current screen. What action should I take next?"}
            ]
        }

        if step > 0:
            messages.append(screenshot_message)

        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            tools=[computer_tool],
            messages=messages,
        )

        # Check if Claude is done
        if response.stop_reason == "end_turn":
            final_text = [b.text for b in response.content if b.type == "text"]
            print(f"Agent completed: {''.join(final_text)}")
            return

        # Execute tool actions
        messages.append({"role": "assistant", "content": response.content})
        for block in response.content:
            if block.type == "tool_use":
                execute_computer_action(block.input)

        print(f"Step {step + 1} completed")
```

## Executing Computer Actions

Claude returns structured action commands that map to system-level input:

```python
import pyautogui
import time

def execute_computer_action(action: dict):
    """Execute a computer use action."""
    action_type = action.get("action")

    if action_type == "mouse_move":
        x, y = action["coordinate"]
        pyautogui.moveTo(x, y)

    elif action_type == "left_click":
        x, y = action["coordinate"]
        pyautogui.click(x, y)

    elif action_type == "left_click_drag":
        start = action["start_coordinate"]
        end = action["coordinate"]
        pyautogui.moveTo(start[0], start[1])
        pyautogui.drag(end[0] - start[0], end[1] - start[1])

    elif action_type == "double_click":
        x, y = action["coordinate"]
        pyautogui.doubleClick(x, y)

    elif action_type == "right_click":
        x, y = action["coordinate"]
        pyautogui.rightClick(x, y)

    elif action_type == "type":
        pyautogui.typewrite(action["text"], interval=0.02)

    elif action_type == "key":
        pyautogui.hotkey(*action["text"].split("+"))

    elif action_type == "screenshot":
        pass  # Will be handled by the next loop iteration

    elif action_type == "scroll":
        x, y = action["coordinate"]
        pyautogui.moveTo(x, y)
        direction = action.get("direction", "down")
        amount = action.get("amount", 3)
        scroll_val = amount if direction == "up" else -amount
        pyautogui.scroll(scroll_val)

    # Brief pause to let the UI update
    time.sleep(0.5)
```

## Verification Strategies

Reliable computer use agents verify that their actions worked. After each action, the next screenshot shows the result. Add verification prompts to your system message:

```python
system_prompt = """You are a computer use agent. After every action:
1. Wait for the screen to update
2. Verify the action had the expected effect
3. If something unexpected happened, try an alternative approach
4. Never assume an action succeeded without visual confirmation

If you encounter an error dialog or unexpected state, describe what
you see and attempt to recover before continuing."""
```

You can also add automated verification by checking for specific visual elements:

```python
def verify_element_present(screenshot_b64: str, description: str) -> bool:
    """Ask Claude to verify an element is visible on screen."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64}
                },
                {"type": "text", "text": f"Is the following element visible? Answer YES or NO: {description}"}
            ]
        }]
    )
    return "YES" in response.content[0].text.upper()
```

## Safety Considerations

Computer use agents can interact with real systems, so safety is critical. Run agents in sandboxed environments like Docker containers or virtual machines. Never give a computer use agent access to sensitive credentials or production systems without human oversight.

## FAQ

### What screen resolution should I use for computer use?

Anthropic recommends 1024x768 for optimal performance. Lower resolutions mean smaller screenshots (fewer tokens and lower cost) while still being clear enough for Claude to identify UI elements. Higher resolutions work but increase token usage and cost.

### Can computer use work with web browsers specifically?

Yes, and browsers are one of the most common use cases. Claude can navigate websites, fill forms, click buttons, and read page content from screenshots. For browser-specific automation, consider running the agent inside a headless browser environment with virtual display (Xvfb) for consistent rendering.

### How reliable is coordinate-based clicking?

Claude is surprisingly accurate at mapping visual elements to coordinates, but dynamic content, pop-ups, and animations can cause misclicks. Build retry logic into your agent — if a click does not produce the expected result, Claude can analyze the new screenshot and try again. Using lower resolutions and waiting for page loads both improve reliability.

---

#Claude #ComputerUse #BrowserAutomation #DesktopAutomation #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/claude-computer-use-agents-automating-desktop-browser
