Skip to content
Computer Use Tool: Building Browser Automation Agents
Learn Agentic AI10 min read23 views

Computer Use Tool: Building Browser Automation Agents

Learn how to build browser automation agents with the OpenAI Agents SDK ComputerTool, implementing the AsyncComputer interface for screenshot capture, mouse clicks, and keyboard input.

Why Computer Use Matters for AI Agents

Most AI agent tools operate on structured APIs — calling functions, querying databases, making HTTP requests. But a massive amount of real-world work happens inside graphical user interfaces: web browsers, desktop applications, legacy systems with no API at all. The Computer Use tool bridges this gap by giving your agent the ability to see a screen, move a mouse, click buttons, and type text — exactly the way a human operator would.

The OpenAI Agents SDK provides the ComputerTool class and a well-defined AsyncComputer interface that lets you plug in any screen environment — a headless browser, a virtual desktop, or a cloud-hosted VM. This post walks through the architecture, the interface contract, and a complete working implementation.

The AsyncComputer Interface

At the heart of computer use is the AsyncComputer protocol. This is an abstract interface you implement to connect the agent to a specific screen environment. The SDK does not ship with a built-in browser or VM — you provide the backend, and the agent controls it through this contract.

flowchart LR
    INPUT(["User input"])
    AGENT["Agent<br/>name plus instructions"]
    HAND{"Handoff to<br/>another agent?"}
    SUB["Sub-agent<br/>specialist"]
    GUARD{"Guardrail<br/>passed?"}
    TOOL["Tool call"]
    SDK[("Tracing<br/>OpenAI dashboard")]
    OUT(["Final output"])
    INPUT --> AGENT --> HAND
    HAND -->|Yes| SUB --> GUARD
    HAND -->|No| GUARD
    GUARD -->|Yes| TOOL --> AGENT
    GUARD -->|Block| OUT
    AGENT --> OUT
    AGENT --> SDK
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style SDK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

Here is the interface definition:

from agents.computers import AsyncComputer, Button, ScreenSize

class MyComputer(AsyncComputer):
    """Custom computer environment for the agent."""

    async def screenshot(self) -> bytes:
        """Capture the current screen state as a PNG image."""
        # Return raw PNG bytes
        ...

    async def click(self, x: int, y: int, button: Button = Button.LEFT) -> None:
        """Click at the specified screen coordinates."""
        ...

    async def double_click(self, x: int, y: int) -> None:
        """Double-click at the specified screen coordinates."""
        ...

    async def type(self, text: str) -> None:
        """Type the given text string."""
        ...

    async def key(self, key_combo: str) -> None:
        """Press a key combination like 'ctrl+c' or 'Enter'."""
        ...

    async def scroll(self, x: int, y: int, direction: str, amount: int) -> None:
        """Scroll at the given position in the specified direction."""
        ...

    async def drag(self, start_x: int, start_y: int, end_x: int, end_y: int) -> None:
        """Drag from one point to another."""
        ...

    async def get_screen_size(self) -> ScreenSize:
        """Return the current screen dimensions."""
        return ScreenSize(width=1920, height=1080)

Every method is async because screen operations often involve network calls to remote environments. The screenshot() method is the most critical — the agent uses the returned image to understand the current state of the screen and decide its next action.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Implementing a Playwright-Based Computer

For browser automation, Playwright is an excellent backend. Here is a complete implementation that connects Playwright to the AsyncComputer interface:

from playwright.async_api import async_playwright, Page
from agents.computers import AsyncComputer, Button, ScreenSize

class PlaywrightComputer(AsyncComputer):
    def __init__(self, width: int = 1280, height: int = 720):
        self.width = width
        self.height = height
        self._page: Page | None = None
        self._playwright = None
        self._browser = None

    async def start(self, url: str = "https://example.com"):
        self._playwright = await async_playwright().start()
        self._browser = await self._playwright.chromium.launch(headless=True)
        context = await self._browser.new_context(
            viewport={"width": self.width, "height": self.height}
        )
        self._page = await context.new_page()
        await self._page.goto(url)

    async def screenshot(self) -> bytes:
        if not self._page:
            raise RuntimeError("Browser not started")
        return await self._page.screenshot(type="png")

    async def click(self, x: int, y: int, button: Button = Button.LEFT) -> None:
        btn = "left" if button == Button.LEFT else "right"
        await self._page.mouse.click(x, y, button=btn)

    async def double_click(self, x: int, y: int) -> None:
        await self._page.mouse.dblclick(x, y)

    async def type(self, text: str) -> None:
        await self._page.keyboard.type(text)

    async def key(self, key_combo: str) -> None:
        await self._page.keyboard.press(key_combo)

    async def scroll(self, x: int, y: int, direction: str, amount: int) -> None:
        delta_y = -amount * 100 if direction == "up" else amount * 100
        await self._page.mouse.move(x, y)
        await self._page.evaluate(
            f"window.scrollBy(0, ${delta_y})"
        )

    async def drag(
        self, start_x: int, start_y: int, end_x: int, end_y: int
    ) -> None:
        await self._page.mouse.move(start_x, start_y)
        await self._page.mouse.down()
        await self._page.mouse.move(end_x, end_y)
        await self._page.mouse.up()

    async def get_screen_size(self) -> ScreenSize:
        return ScreenSize(width=self.width, height=self.height)

    async def close(self):
        if self._browser:
            await self._browser.close()
        if self._playwright:
            await self._playwright.stop()

The key design decision here is that screenshot() returns raw PNG bytes. The SDK handles encoding and passing the image to the model. The model analyzes the screenshot, identifies UI elements, and decides what coordinates to click or what text to type next.

Wiring Up the ComputerTool

Once your AsyncComputer implementation is ready, you connect it to an agent using the ComputerTool class:

from agents import Agent, Runner
from agents.computers import ComputerTool
import asyncio

async def main():
    computer = PlaywrightComputer(width=1280, height=720)
    await computer.start("https://news.ycombinator.com")

    agent = Agent(
        name="BrowserAgent",
        instructions="""You are a browser automation agent. You can see
        the screen via screenshots and interact using click, type, and
        scroll actions. Complete the user's task step by step.
        Always take a screenshot first to understand the current state.""",
        tools=[
            ComputerTool(computer=computer),
        ],
        model="gpt-4o",
    )

    result = await Runner.run(
        agent,
        input="Find the top story on Hacker News and tell me its title and URL",
    )
    print(result.final_output)
    await computer.close()

asyncio.run(main())

The ComputerTool wraps your computer implementation and exposes it as a standard agent tool. The model will call screenshot to see the screen, then issue click, type, or scroll commands based on what it observes.

The Agent Loop in Computer Use

Understanding the execution loop is essential for debugging. When the agent runs with a ComputerTool, the loop looks like this:

  1. The agent calls screenshot() to observe the current screen
  2. The model analyzes the image and decides on an action (click, type, scroll)
  3. The agent executes the action via the corresponding AsyncComputer method
  4. The agent calls screenshot() again to verify the result
  5. The model evaluates whether the task is complete or another action is needed
  6. Repeat until the task is done or the agent determines it cannot proceed

This observe-act-verify loop means each step involves at least one vision model call. For complex multi-step workflows, this can generate significant token usage and latency. Plan your agent's instructions to minimize unnecessary exploration.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Best Practices for Production Computer Use

Set explicit screen sizes. The model performs better with consistent, known viewport dimensions. Use 1280x720 or 1024x768 rather than variable sizes.

Add wait logic after actions. After clicking a link or submitting a form, the page needs time to load. Build delays or wait-for-selector logic into your click and type methods to avoid screenshots of partially loaded pages.

Limit the action loop. Set a maximum number of steps in your runner to prevent the agent from getting stuck in infinite retry loops:

result = await Runner.run(
    agent,
    input="Complete the checkout form",
    max_turns=20,
)

Use structured instructions. Tell the agent exactly what success looks like. Instead of "fill out the form," say "fill out the form with name John Doe, email john@example.com, then click Submit and confirm you see a success message."

Handle errors gracefully. If a screenshot shows an error dialog or unexpected state, the agent should report the issue rather than retrying blindly. Include this in your system instructions.

Computer use opens up automation for any interface a human can see and interact with — legacy systems, complex web applications, and workflows that span multiple tools. The AsyncComputer interface keeps your implementation cleanly separated from the agent logic, making it straightforward to swap between Playwright, Selenium, VNC, or any other screen backend.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.