Claude Computer Use for Agents: Automating Desktop and Browser Tasks

What is Claude Computer Use

Claude computer use allows an AI agent to interact with a computer the way a human does — by looking at screenshots and performing mouse clicks, keyboard input, and scrolling. Instead of calling APIs or parsing HTML, the agent sees the screen as an image and decides what actions to take based on visual understanding.

This capability is useful for automating legacy applications that lack APIs, testing web applications, filling out forms across multiple websites, and any workflow where a human would normally sit at a computer clicking through screens.

How Computer Use Works

The workflow follows a perception-action loop:

flowchart LR
    GOAL(["High level goal"])
    PLAN["Planner LLM"]
    SCREEN["Screen capture<br/>every step"]
    VLM["Vision LLM<br/>reads UI state"]
    ACT{"Action type"}
    CLICK["Click coordinate"]
    TYPE["Type text"]
    KEY["Keyboard shortcut"]
    GUARD["Safety filter<br/>allow lists"]
    OS[("OS sandbox<br/>ephemeral VM")]
    DONE(["Goal verified"])
    GOAL --> PLAN --> SCREEN --> VLM --> ACT
    ACT --> CLICK --> GUARD
    ACT --> TYPE --> GUARD
    ACT --> KEY --> GUARD
    GUARD --> OS --> SCREEN
    OS --> DONE
    style PLAN fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style DONE fill:#059669,stroke:#047857,color:#fff

Your code takes a screenshot of the screen
The screenshot is sent to Claude as an image
Claude analyzes the screenshot and decides what action to take
Your code executes that action (click, type, scroll)
A new screenshot is taken and the loop repeats

Claude uses a special computer_20250124 tool that defines the available actions. The tool specification tells Claude the screen dimensions so it can map visual elements to pixel coordinates.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Setting Up the Computer Use Tool

import anthropic
import base64
import subprocess
import json

client = anthropic.Anthropic()

# Define the computer use tool with your screen dimensions
computer_tool = {
    "type": "computer_20250124",
    "name": "computer",
    "display_width_px": 1920,
    "display_height_px": 1080,
    "display_number": 0,
}

def take_screenshot() -> str:
    """Capture the screen and return base64-encoded PNG."""
    subprocess.run(["scrot", "/tmp/screenshot.png", "-o"], check=True)
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode()

The display_width_px and display_height_px must match your actual screen resolution. Claude uses these dimensions to calculate pixel coordinates for clicks.

Building the Computer Use Agent Loop

The agent loop sends screenshots to Claude and executes the returned actions:

def run_computer_agent(task: str, max_steps: int = 30):
    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        # Take a screenshot
        screenshot_b64 = take_screenshot()

        # Add screenshot to the conversation
        screenshot_message = {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    }
                },
                {"type": "text", "text": "Here is the current screen. What action should I take next?"}
            ]
        }

        if step > 0:
            messages.append(screenshot_message)

        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            tools=[computer_tool],
            messages=messages,
        )

        # Check if Claude is done
        if response.stop_reason == "end_turn":
            final_text = [b.text for b in response.content if b.type == "text"]
            print(f"Agent completed: {''.join(final_text)}")
            return

        # Execute tool actions
        messages.append({"role": "assistant", "content": response.content})
        for block in response.content:
            if block.type == "tool_use":
                execute_computer_action(block.input)

        print(f"Step {step + 1} completed")

Executing Computer Actions

Claude returns structured action commands that map to system-level input:

import pyautogui
import time

def execute_computer_action(action: dict):
    """Execute a computer use action."""
    action_type = action.get("action")

    if action_type == "mouse_move":
        x, y = action["coordinate"]
        pyautogui.moveTo(x, y)

    elif action_type == "left_click":
        x, y = action["coordinate"]
        pyautogui.click(x, y)

    elif action_type == "left_click_drag":
        start = action["start_coordinate"]
        end = action["coordinate"]
        pyautogui.moveTo(start[0], start[1])
        pyautogui.drag(end[0] - start[0], end[1] - start[1])

    elif action_type == "double_click":
        x, y = action["coordinate"]
        pyautogui.doubleClick(x, y)

    elif action_type == "right_click":
        x, y = action["coordinate"]
        pyautogui.rightClick(x, y)

    elif action_type == "type":
        pyautogui.typewrite(action["text"], interval=0.02)

    elif action_type == "key":
        pyautogui.hotkey(*action["text"].split("+"))

    elif action_type == "screenshot":
        pass  # Will be handled by the next loop iteration

    elif action_type == "scroll":
        x, y = action["coordinate"]
        pyautogui.moveTo(x, y)
        direction = action.get("direction", "down")
        amount = action.get("amount", 3)
        scroll_val = amount if direction == "up" else -amount
        pyautogui.scroll(scroll_val)

    # Brief pause to let the UI update
    time.sleep(0.5)

Verification Strategies

Reliable computer use agents verify that their actions worked. After each action, the next screenshot shows the result. Add verification prompts to your system message:

system_prompt = """You are a computer use agent. After every action:
1. Wait for the screen to update
2. Verify the action had the expected effect
3. If something unexpected happened, try an alternative approach
4. Never assume an action succeeded without visual confirmation

If you encounter an error dialog or unexpected state, describe what
you see and attempt to recover before continuing."""

You can also add automated verification by checking for specific visual elements:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

def verify_element_present(screenshot_b64: str, description: str) -> bool:
    """Ask Claude to verify an element is visible on screen."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64}
                },
                {"type": "text", "text": f"Is the following element visible? Answer YES or NO: {description}"}
            ]
        }]
    )
    return "YES" in response.content[0].text.upper()

Safety Considerations

Computer use agents can interact with real systems, so safety is critical. Run agents in sandboxed environments like Docker containers or virtual machines. Never give a computer use agent access to sensitive credentials or production systems without human oversight.

FAQ

What screen resolution should I use for computer use?

Anthropic recommends 1024x768 for optimal performance. Lower resolutions mean smaller screenshots (fewer tokens and lower cost) while still being clear enough for Claude to identify UI elements. Higher resolutions work but increase token usage and cost.

Can computer use work with web browsers specifically?

Yes, and browsers are one of the most common use cases. Claude can navigate websites, fill forms, click buttons, and read page content from screenshots. For browser-specific automation, consider running the agent inside a headless browser environment with virtual display (Xvfb) for consistent rendering.

How reliable is coordinate-based clicking?

Claude is surprisingly accurate at mapping visual elements to coordinates, but dynamic content, pop-ups, and animations can cause misclicks. Build retry logic into your agent — if a click does not produce the expected result, Claude can analyze the new screenshot and try again. Using lower resolutions and waiting for page loads both improve reliability.

#Claude #ComputerUse #BrowserAutomation #DesktopAutomation #Python #AgenticAI #LearnAI #AIEngineering

Claude Computer Use for Agents: Automating Desktop and Browser Tasks

What is Claude Computer Use

How Computer Use Works

Setting Up the Computer Use Tool

Building the Computer Use Agent Loop

Executing Computer Actions

Verification Strategies

Safety Considerations

FAQ

What screen resolution should I use for computer use?

Can computer use work with web browsers specifically?

How reliable is coordinate-based clicking?

Try CallSphere AI Voice Agents

Related Articles You May Like

How to Use Multiple Chat AIs at Once (and Why You Might)

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

Anthropic and Moody's Data Partnership: Why Grounding Matters in Finance

Anthropic Microsoft 365 Integration: What Changes for Office Knowledge Workers

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough