Skip to content
Learn Agentic AI
Learn Agentic AI10 min read8 views

Computer Use Tool: Building Browser Automation Agents

Learn how to build browser automation agents with the OpenAI Agents SDK ComputerTool, implementing the AsyncComputer interface for screenshot capture, mouse clicks, and keyboard input.

Why Computer Use Matters for AI Agents

Most AI agent tools operate on structured APIs — calling functions, querying databases, making HTTP requests. But a massive amount of real-world work happens inside graphical user interfaces: web browsers, desktop applications, legacy systems with no API at all. The Computer Use tool bridges this gap by giving your agent the ability to see a screen, move a mouse, click buttons, and type text — exactly the way a human operator would.

The OpenAI Agents SDK provides the ComputerTool class and a well-defined AsyncComputer interface that lets you plug in any screen environment — a headless browser, a virtual desktop, or a cloud-hosted VM. This post walks through the architecture, the interface contract, and a complete working implementation.

The AsyncComputer Interface

At the heart of computer use is the AsyncComputer protocol. This is an abstract interface you implement to connect the agent to a specific screen environment. The SDK does not ship with a built-in browser or VM — you provide the backend, and the agent controls it through this contract.

flowchart TD
    START["Computer Use Tool: Building Browser Automation Ag…"] --> A
    A["Why Computer Use Matters for AI Agents"]
    A --> B
    B["The AsyncComputer Interface"]
    B --> C
    C["Implementing a Playwright-Based Computer"]
    C --> D
    D["Wiring Up the ComputerTool"]
    D --> E
    E["The Agent Loop in Computer Use"]
    E --> F
    F["Best Practices for Production Computer …"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Here is the interface definition:

from agents.computers import AsyncComputer, Button, ScreenSize

class MyComputer(AsyncComputer):
    """Custom computer environment for the agent."""

    async def screenshot(self) -> bytes:
        """Capture the current screen state as a PNG image."""
        # Return raw PNG bytes
        ...

    async def click(self, x: int, y: int, button: Button = Button.LEFT) -> None:
        """Click at the specified screen coordinates."""
        ...

    async def double_click(self, x: int, y: int) -> None:
        """Double-click at the specified screen coordinates."""
        ...

    async def type(self, text: str) -> None:
        """Type the given text string."""
        ...

    async def key(self, key_combo: str) -> None:
        """Press a key combination like 'ctrl+c' or 'Enter'."""
        ...

    async def scroll(self, x: int, y: int, direction: str, amount: int) -> None:
        """Scroll at the given position in the specified direction."""
        ...

    async def drag(self, start_x: int, start_y: int, end_x: int, end_y: int) -> None:
        """Drag from one point to another."""
        ...

    async def get_screen_size(self) -> ScreenSize:
        """Return the current screen dimensions."""
        return ScreenSize(width=1920, height=1080)

Every method is async because screen operations often involve network calls to remote environments. The screenshot() method is the most critical — the agent uses the returned image to understand the current state of the screen and decide its next action.

Implementing a Playwright-Based Computer

For browser automation, Playwright is an excellent backend. Here is a complete implementation that connects Playwright to the AsyncComputer interface:

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["The agent calls screenshot to observe t…"]
    CENTER --> N1["The model analyzes the image and decide…"]
    CENTER --> N2["The agent executes the action via the c…"]
    CENTER --> N3["The agent calls screenshot again to ver…"]
    CENTER --> N4["The model evaluates whether the task is…"]
    CENTER --> N5["Repeat until the task is done or the ag…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
from playwright.async_api import async_playwright, Page
from agents.computers import AsyncComputer, Button, ScreenSize

class PlaywrightComputer(AsyncComputer):
    def __init__(self, width: int = 1280, height: int = 720):
        self.width = width
        self.height = height
        self._page: Page | None = None
        self._playwright = None
        self._browser = None

    async def start(self, url: str = "https://example.com"):
        self._playwright = await async_playwright().start()
        self._browser = await self._playwright.chromium.launch(headless=True)
        context = await self._browser.new_context(
            viewport={"width": self.width, "height": self.height}
        )
        self._page = await context.new_page()
        await self._page.goto(url)

    async def screenshot(self) -> bytes:
        if not self._page:
            raise RuntimeError("Browser not started")
        return await self._page.screenshot(type="png")

    async def click(self, x: int, y: int, button: Button = Button.LEFT) -> None:
        btn = "left" if button == Button.LEFT else "right"
        await self._page.mouse.click(x, y, button=btn)

    async def double_click(self, x: int, y: int) -> None:
        await self._page.mouse.dblclick(x, y)

    async def type(self, text: str) -> None:
        await self._page.keyboard.type(text)

    async def key(self, key_combo: str) -> None:
        await self._page.keyboard.press(key_combo)

    async def scroll(self, x: int, y: int, direction: str, amount: int) -> None:
        delta_y = -amount * 100 if direction == "up" else amount * 100
        await self._page.mouse.move(x, y)
        await self._page.evaluate(
            f"window.scrollBy(0, ${delta_y})"
        )

    async def drag(
        self, start_x: int, start_y: int, end_x: int, end_y: int
    ) -> None:
        await self._page.mouse.move(start_x, start_y)
        await self._page.mouse.down()
        await self._page.mouse.move(end_x, end_y)
        await self._page.mouse.up()

    async def get_screen_size(self) -> ScreenSize:
        return ScreenSize(width=self.width, height=self.height)

    async def close(self):
        if self._browser:
            await self._browser.close()
        if self._playwright:
            await self._playwright.stop()

The key design decision here is that screenshot() returns raw PNG bytes. The SDK handles encoding and passing the image to the model. The model analyzes the screenshot, identifies UI elements, and decides what coordinates to click or what text to type next.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Wiring Up the ComputerTool

Once your AsyncComputer implementation is ready, you connect it to an agent using the ComputerTool class:

from agents import Agent, Runner
from agents.computers import ComputerTool
import asyncio

async def main():
    computer = PlaywrightComputer(width=1280, height=720)
    await computer.start("https://news.ycombinator.com")

    agent = Agent(
        name="BrowserAgent",
        instructions="""You are a browser automation agent. You can see
        the screen via screenshots and interact using click, type, and
        scroll actions. Complete the user's task step by step.
        Always take a screenshot first to understand the current state.""",
        tools=[
            ComputerTool(computer=computer),
        ],
        model="gpt-4o",
    )

    result = await Runner.run(
        agent,
        input="Find the top story on Hacker News and tell me its title and URL",
    )
    print(result.final_output)
    await computer.close()

asyncio.run(main())

The ComputerTool wraps your computer implementation and exposes it as a standard agent tool. The model will call screenshot to see the screen, then issue click, type, or scroll commands based on what it observes.

The Agent Loop in Computer Use

Understanding the execution loop is essential for debugging. When the agent runs with a ComputerTool, the loop looks like this:

  1. The agent calls screenshot() to observe the current screen
  2. The model analyzes the image and decides on an action (click, type, scroll)
  3. The agent executes the action via the corresponding AsyncComputer method
  4. The agent calls screenshot() again to verify the result
  5. The model evaluates whether the task is complete or another action is needed
  6. Repeat until the task is done or the agent determines it cannot proceed

This observe-act-verify loop means each step involves at least one vision model call. For complex multi-step workflows, this can generate significant token usage and latency. Plan your agent's instructions to minimize unnecessary exploration.

Best Practices for Production Computer Use

Set explicit screen sizes. The model performs better with consistent, known viewport dimensions. Use 1280x720 or 1024x768 rather than variable sizes.

Add wait logic after actions. After clicking a link or submitting a form, the page needs time to load. Build delays or wait-for-selector logic into your click and type methods to avoid screenshots of partially loaded pages.

Limit the action loop. Set a maximum number of steps in your runner to prevent the agent from getting stuck in infinite retry loops:

result = await Runner.run(
    agent,
    input="Complete the checkout form",
    max_turns=20,
)

Use structured instructions. Tell the agent exactly what success looks like. Instead of "fill out the form," say "fill out the form with name John Doe, email [email protected], then click Submit and confirm you see a success message."

Handle errors gracefully. If a screenshot shows an error dialog or unexpected state, the agent should report the issue rather than retrying blindly. Include this in your system instructions.

Computer use opens up automation for any interface a human can see and interact with — legacy systems, complex web applications, and workflows that span multiple tools. The AsyncComputer interface keeps your implementation cleanly separated from the agent logic, making it straightforward to swap between Playwright, Selenium, VNC, or any other screen backend.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)

A full technical walkthrough of how modern AI voice agents work — speech-to-text, LLM orchestration, TTS, tool calling, and sub-second latency.

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

Technical Guides

Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.