Skip to content
Learn Agentic AI
Learn Agentic AI12 min read1 views

Claude Computer Use for Agents: Automating Desktop and Browser Tasks

Learn how to build agents that use Claude's computer use capability to analyze screenshots, map coordinates, execute mouse and keyboard actions, and verify results on desktop and browser interfaces.

What is Claude Computer Use

Claude computer use allows an AI agent to interact with a computer the way a human does — by looking at screenshots and performing mouse clicks, keyboard input, and scrolling. Instead of calling APIs or parsing HTML, the agent sees the screen as an image and decides what actions to take based on visual understanding.

This capability is useful for automating legacy applications that lack APIs, testing web applications, filling out forms across multiple websites, and any workflow where a human would normally sit at a computer clicking through screens.

How Computer Use Works

The workflow follows a perception-action loop:

flowchart TD
    START["Claude Computer Use for Agents: Automating Deskto…"] --> A
    A["What is Claude Computer Use"]
    A --> B
    B["How Computer Use Works"]
    B --> C
    C["Setting Up the Computer Use Tool"]
    C --> D
    D["Building the Computer Use Agent Loop"]
    D --> E
    E["Executing Computer Actions"]
    E --> F
    F["Verification Strategies"]
    F --> G
    G["Safety Considerations"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
  1. Your code takes a screenshot of the screen
  2. The screenshot is sent to Claude as an image
  3. Claude analyzes the screenshot and decides what action to take
  4. Your code executes that action (click, type, scroll)
  5. A new screenshot is taken and the loop repeats

Claude uses a special computer_20250124 tool that defines the available actions. The tool specification tells Claude the screen dimensions so it can map visual elements to pixel coordinates.

Setting Up the Computer Use Tool

import anthropic
import base64
import subprocess
import json

client = anthropic.Anthropic()

# Define the computer use tool with your screen dimensions
computer_tool = {
    "type": "computer_20250124",
    "name": "computer",
    "display_width_px": 1920,
    "display_height_px": 1080,
    "display_number": 0,
}

def take_screenshot() -> str:
    """Capture the screen and return base64-encoded PNG."""
    subprocess.run(["scrot", "/tmp/screenshot.png", "-o"], check=True)
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode()

The display_width_px and display_height_px must match your actual screen resolution. Claude uses these dimensions to calculate pixel coordinates for clicks.

Building the Computer Use Agent Loop

The agent loop sends screenshots to Claude and executes the returned actions:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Your code takes a screenshot of the scr…"]
    CENTER --> N1["The screenshot is sent to Claude as an …"]
    CENTER --> N2["Claude analyzes the screenshot and deci…"]
    CENTER --> N3["Your code executes that action click, t…"]
    CENTER --> N4["A new screenshot is taken and the loop …"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
def run_computer_agent(task: str, max_steps: int = 30):
    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        # Take a screenshot
        screenshot_b64 = take_screenshot()

        # Add screenshot to the conversation
        screenshot_message = {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    }
                },
                {"type": "text", "text": "Here is the current screen. What action should I take next?"}
            ]
        }

        if step > 0:
            messages.append(screenshot_message)

        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            tools=[computer_tool],
            messages=messages,
        )

        # Check if Claude is done
        if response.stop_reason == "end_turn":
            final_text = [b.text for b in response.content if b.type == "text"]
            print(f"Agent completed: {''.join(final_text)}")
            return

        # Execute tool actions
        messages.append({"role": "assistant", "content": response.content})
        for block in response.content:
            if block.type == "tool_use":
                execute_computer_action(block.input)

        print(f"Step {step + 1} completed")

Executing Computer Actions

Claude returns structured action commands that map to system-level input:

import pyautogui
import time

def execute_computer_action(action: dict):
    """Execute a computer use action."""
    action_type = action.get("action")

    if action_type == "mouse_move":
        x, y = action["coordinate"]
        pyautogui.moveTo(x, y)

    elif action_type == "left_click":
        x, y = action["coordinate"]
        pyautogui.click(x, y)

    elif action_type == "left_click_drag":
        start = action["start_coordinate"]
        end = action["coordinate"]
        pyautogui.moveTo(start[0], start[1])
        pyautogui.drag(end[0] - start[0], end[1] - start[1])

    elif action_type == "double_click":
        x, y = action["coordinate"]
        pyautogui.doubleClick(x, y)

    elif action_type == "right_click":
        x, y = action["coordinate"]
        pyautogui.rightClick(x, y)

    elif action_type == "type":
        pyautogui.typewrite(action["text"], interval=0.02)

    elif action_type == "key":
        pyautogui.hotkey(*action["text"].split("+"))

    elif action_type == "screenshot":
        pass  # Will be handled by the next loop iteration

    elif action_type == "scroll":
        x, y = action["coordinate"]
        pyautogui.moveTo(x, y)
        direction = action.get("direction", "down")
        amount = action.get("amount", 3)
        scroll_val = amount if direction == "up" else -amount
        pyautogui.scroll(scroll_val)

    # Brief pause to let the UI update
    time.sleep(0.5)

Verification Strategies

Reliable computer use agents verify that their actions worked. After each action, the next screenshot shows the result. Add verification prompts to your system message:

system_prompt = """You are a computer use agent. After every action:
1. Wait for the screen to update
2. Verify the action had the expected effect
3. If something unexpected happened, try an alternative approach
4. Never assume an action succeeded without visual confirmation

If you encounter an error dialog or unexpected state, describe what
you see and attempt to recover before continuing."""

You can also add automated verification by checking for specific visual elements:

def verify_element_present(screenshot_b64: str, description: str) -> bool:
    """Ask Claude to verify an element is visible on screen."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {"type": "base64", "media_type": "image/png", "data": screenshot_b64}
                },
                {"type": "text", "text": f"Is the following element visible? Answer YES or NO: {description}"}
            ]
        }]
    )
    return "YES" in response.content[0].text.upper()

Safety Considerations

Computer use agents can interact with real systems, so safety is critical. Run agents in sandboxed environments like Docker containers or virtual machines. Never give a computer use agent access to sensitive credentials or production systems without human oversight.

FAQ

What screen resolution should I use for computer use?

Anthropic recommends 1024x768 for optimal performance. Lower resolutions mean smaller screenshots (fewer tokens and lower cost) while still being clear enough for Claude to identify UI elements. Higher resolutions work but increase token usage and cost.

Can computer use work with web browsers specifically?

Yes, and browsers are one of the most common use cases. Claude can navigate websites, fill forms, click buttons, and read page content from screenshots. For browser-specific automation, consider running the agent inside a headless browser environment with virtual display (Xvfb) for consistent rendering.

How reliable is coordinate-based clicking?

Claude is surprisingly accurate at mapping visual elements to coordinates, but dynamic content, pop-ups, and animations can cause misclicks. Build retry logic into your agent — if a click does not produce the expected result, Claude can analyze the new screenshot and try again. Using lower resolutions and waiting for page loads both improve reliability.


#Claude #ComputerUse #BrowserAutomation #DesktopAutomation #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Building Your First MCP Server: Connect AI Agents to Any External Tool

Step-by-step tutorial on building an MCP server in TypeScript, registering tools and resources, handling requests, and connecting to Claude and other LLM clients.

Learn Agentic AI

How to Build an AI Coding Assistant with Claude and MCP: Step-by-Step Guide

Build a powerful AI coding assistant that reads files, runs tests, and fixes bugs using the Claude API and Model Context Protocol servers in TypeScript.

Learn Agentic AI

Computer Use Agents 2026: How Claude, GPT-5.4, and Gemini Navigate Desktop Applications

Comparison of computer use capabilities across Claude, GPT-5.4, and Gemini including accuracy benchmarks, speed tests, supported applications, and real-world limitations.