Skip to content
Learn Agentic AI
Learn Agentic AI13 min read8 views

Building a Claude Browser Agent: Automated Web Navigation with Anthropic SDK

Step-by-step guide to building a browser automation agent with Claude Computer Use — from SDK setup and screenshot capture to executing click, type, and scroll actions for real web navigation tasks.

Setting Up the Environment

Building a Claude browser agent requires three components: the Anthropic Python SDK, a browser that can be controlled programmatically for screenshot capture, and an input simulation layer. We will use Playwright for browser management (to launch and screenshot) while letting Claude drive all the navigation decisions.

Start by installing the dependencies:

# requirements.txt
anthropic>=0.39.0
playwright>=1.40.0
Pillow>=10.0.0

Initialize the project:

pip install -r requirements.txt
playwright install chromium

Architecture of the Browser Agent

The agent architecture has three layers:

flowchart TD
    START["Building a Claude Browser Agent: Automated Web Na…"] --> A
    A["Setting Up the Environment"]
    A --> B
    B["Architecture of the Browser Agent"]
    B --> C
    C["The Agent Loop"]
    C --> D
    D["Running the Agent"]
    D --> E
    E["Optimizing Conversation History"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
  1. Browser Manager — Launches a headless or headed Chromium instance, navigates to a starting URL, captures screenshots, and executes low-level browser actions
  2. Action Executor — Translates Claude's computer use tool calls into Playwright mouse and keyboard commands
  3. Agent Loop — Orchestrates the screenshot-action cycle and manages the conversation history with Claude

Here is the complete browser manager:

import asyncio
from playwright.async_api import async_playwright, Page, Browser
import base64

class BrowserManager:
    def __init__(self, width: int = 1280, height: int = 800):
        self.width = width
        self.height = height
        self.browser: Browser | None = None
        self.page: Page | None = None

    async def start(self, url: str = "about:blank"):
        pw = await async_playwright().start()
        self.browser = await pw.chromium.launch(headless=False)
        context = await self.browser.new_context(
            viewport={"width": self.width, "height": self.height}
        )
        self.page = await context.new_page()
        await self.page.goto(url)

    async def screenshot(self) -> str:
        """Capture current page as base64 PNG."""
        img_bytes = await self.page.screenshot(full_page=False)
        return base64.standard_b64encode(img_bytes).decode()

    async def click(self, x: int, y: int, button: str = "left"):
        await self.page.mouse.click(x, y, button=button)

    async def type_text(self, text: str):
        await self.page.keyboard.type(text, delay=50)

    async def press_key(self, key: str):
        await self.page.keyboard.press(key)

    async def scroll(self, x: int, y: int, direction: str):
        await self.page.mouse.move(x, y)
        delta = 300 if direction == "down" else -300
        await self.page.mouse.wheel(0, delta)

    async def close(self):
        if self.browser:
            await self.browser.close()

The Agent Loop

The agent loop ties everything together. It sends screenshots to Claude, processes tool calls, executes actions, and repeats until the task is done:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import anthropic

class ClaudeBrowserAgent:
    def __init__(self, browser: BrowserManager):
        self.browser = browser
        self.client = anthropic.Anthropic()
        self.messages = []
        self.model = "claude-sonnet-4-20250514"

    async def run(self, task: str, max_steps: int = 30):
        self.messages = [{"role": "user", "content": task}]

        for step in range(max_steps):
            screenshot_b64 = await self.browser.screenshot()

            self.messages.append({
                "role": "user",
                "content": [{
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64,
                    },
                }],
            })

            response = self.client.messages.create(
                model=self.model,
                max_tokens=1024,
                tools=[{
                    "type": "computer_20241022",
                    "name": "computer",
                    "display_width_px": self.browser.width,
                    "display_height_px": self.browser.height,
                    "display_number": 0,
                }],
                messages=self.messages,
            )

            if response.stop_reason == "end_turn":
                final_text = next(
                    (b.text for b in response.content if hasattr(b, "text")),
                    "Task complete"
                )
                print(f"Done: {final_text}")
                return final_text

            assistant_content = response.content
            self.messages.append({"role": "assistant", "content": assistant_content})

            for block in assistant_content:
                if block.type == "tool_use":
                    await self._execute(block.input)
                    self.messages.append({
                        "role": "user",
                        "content": [{
                            "type": "tool_result",
                            "tool_use_id": block.id,
                            "content": "Action executed successfully",
                        }],
                    })
                    await asyncio.sleep(1)  # Wait for page to render

        return "Max steps reached"

    async def _execute(self, action: dict):
        action_type = action.get("action", action.get("type"))
        if action_type == "click":
            x, y = action["coordinate"]
            await self.browser.click(x, y)
        elif action_type == "type":
            await self.browser.type_text(action["text"])
        elif action_type == "key":
            await self.browser.press_key(action["text"])
        elif action_type == "scroll":
            x, y = action["coordinate"]
            await self.browser.scroll(x, y, action["direction"])

Running the Agent

Here is how to use the agent for a real web navigation task:

async def main():
    browser = BrowserManager(width=1280, height=800)
    await browser.start("https://news.ycombinator.com")

    agent = ClaudeBrowserAgent(browser)
    result = await agent.run(
        "Find the top story on Hacker News and click on the comments link. "
        "Then tell me how many comments the story has."
    )
    print(result)
    await browser.close()

asyncio.run(main())

The agent will take a screenshot of the Hacker News homepage, identify the top story, locate the comments link, click it, take another screenshot of the comments page, and report the comment count back to you.

Optimizing Conversation History

A critical performance consideration is managing the message history. Each screenshot consumes a significant number of tokens. If your task requires 20 steps, you are sending 20 high-resolution images in the conversation. This gets expensive and eventually hits context limits.

A practical optimization is to maintain a sliding window of recent screenshots while summarizing older interactions as text:

def trim_history(messages: list, keep_last: int = 5) -> list:
    """Keep only the last N screenshot exchanges."""
    trimmed = [messages[0]]  # Keep original task
    image_exchanges = [m for m in messages[1:] if _has_image(m)]

    if len(image_exchanges) > keep_last:
        trimmed.append({
            "role": "user",
            "content": f"[Previous {len(image_exchanges) - keep_last} "
                       f"steps completed successfully]"
        })

    # Keep last N exchanges intact
    start_idx = max(1, len(messages) - keep_last * 3)
    trimmed.extend(messages[start_idx:])
    return trimmed

FAQ

Can I use a headless browser with Claude Computer Use?

Yes, and it is recommended for server-side deployments. Playwright supports headless mode, and the screenshots are identical to what you would see in a headed browser. Set headless=True when launching the browser.

How do I handle pages that take time to load?

Add a short delay (1-2 seconds) after executing each action before capturing the next screenshot. For pages with dynamic content, you can also use Playwright's wait_for_load_state("networkidle") before taking the screenshot.

What is the cost per step of the agent loop?

Each step involves sending a screenshot image plus the conversation history to Claude. A 1280x800 screenshot typically costs around 1,000-1,500 input tokens. With the conversation context, expect roughly 2,000-5,000 tokens per step. At Claude Sonnet pricing, a 20-step task costs approximately $0.15-$0.40 depending on conversation length.


#ClaudeBrowserAgent #WebAutomation #AnthropicSDK #ComputerUse #AIBrowserAgent #PythonAutomation #AgenticAI

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Building Your First MCP Server: Connect AI Agents to Any External Tool

Step-by-step tutorial on building an MCP server in TypeScript, registering tools and resources, handling requests, and connecting to Claude and other LLM clients.

Learn Agentic AI

How to Build an AI Coding Assistant with Claude and MCP: Step-by-Step Guide

Build a powerful AI coding assistant that reads files, runs tests, and fixes bugs using the Claude API and Model Context Protocol servers in TypeScript.

Learn Agentic AI

Computer Use Agents 2026: How Claude, GPT-5.4, and Gemini Navigate Desktop Applications

Comparison of computer use capabilities across Claude, GPT-5.4, and Gemini including accuracy benchmarks, speed tests, supported applications, and real-world limitations.