OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

TL;DR

OpenAI's computer_use_preview tool gives a model a hand on a real browser: it sees a screenshot, decides to click at (x, y), type a string, scroll, or press a key, and then sees the next screenshot. That is the entire interface, and it changes the math on what you can automate. The catch is that the model is right about 55–70% of the time on real tasks, depending on the website, and the only honest way to ship CUA into production is with a measured benchmark, hard guardrails on destructive actions, and a budget for retries. This post walks through the working agent loop with the OpenAI Agents SDK and Playwright, then hands you a 10-task evaluation harness that produces a number you can defend in a release review. Real cost: about $0.18–$0.42 per successful task on computer-use-preview-2026-03-11. Real failure modes included.

Why Computer Use Is Different From Normal Tool Calling

Normal agents call typed functions: get_calendar(date), book_appointment(slot_id). The model never touches the UI. Computer-use agents flip that: the model is given pixels and asked to act on the pixels. There are three reasons to care.

No API surface required. If the SaaS vendor has a web UI but no API, CUA is the only path that doesn't involve scraping fragile selectors.
Long-tail tasks generalize. "Find the cheapest plan on this pricing page" works on Stripe, Notion, Vercel, and the random vendor your customer just signed up with — without a per-vendor adapter.
Same model, two skills. The same checkpoint that reasons about the task plans the next click. There is no separate planner.

The price you pay: the model sometimes clicks the wrong thing, sometimes hallucinates that it succeeded, and sometimes tries to take destructive actions. You cannot ship this without an eval and without guardrails. We learned that pattern across our browser-driven outreach automation and it generalizes.

The Loop, Drawn

flowchart LR
  A[User task] --> B[Take screenshot]
  B --> C[Send to computer-use-preview]
  C --> D{Model output}
  D -->|action: click x,y| E[Playwright click]
  D -->|action: type text| F[Playwright type]
  D -->|action: scroll dx,dy| G[Playwright scroll]
  D -->|action: key Enter| H[Playwright press]
  D -->|done + final answer| Z[Return result]
  E --> I[Guardrail check]
  F --> I
  G --> I
  H --> I
  I -->|safe| B
  I -->|destructive| X[Block + ask human]
  style A fill:#fee
  style Z fill:#cfc
  style X fill:#fcc

Figure 1 — The CUA loop. Every action passes through a guardrail before Playwright executes it. Every step appends a fresh screenshot to the conversation, so the model always sees the latest state.

The conversation grows by one image per step. By step 30 you are paying for 30 screenshots in context, which is why a step budget is non-negotiable.

Building the Agent (Working Code)

The pinned model is computer-use-preview-2026-03-11. The browser is Playwright Chromium at 1280x800 — that resolution matters because the model returns pixel coordinates relative to the viewport you declared.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

import asyncio, base64
from openai import OpenAI
from playwright.async_api import async_playwright

client = OpenAI()
MODEL = "computer-use-preview-2026-03-11"
VIEWPORT = {"width": 1280, "height": 800}

DESTRUCTIVE_KEYWORDS = (
    "delete account", "cancel subscription", "wire transfer", "transfer funds",
    "logout", "sign out", "remove user", "purchase", "buy now",
)

async def screenshot_b64(page):
    png = await page.screenshot(type="png")
    return base64.b64encode(png).decode()

async def is_destructive(page, action):
    if action["type"] == "click":
        # Read the text under the click target
        el = await page.evaluate(
            "([x,y]) => document.elementFromPoint(x,y)?.innerText || ''",
            [action["x"], action["y"]],
        )
        return any(k in (el or "").lower() for k in DESTRUCTIVE_KEYWORDS)
    return False

async def run_cua(task: str, start_url: str, max_steps: int = 25):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        ctx = await browser.new_context(viewport=VIEWPORT)
        page = await ctx.new_page()
        await page.goto(start_url, wait_until="domcontentloaded")

        # Seed the conversation with the task + first screenshot
        screenshot = await screenshot_b64(page)
        response = client.responses.create(
            model=MODEL,
            tools=[{
                "type": "computer_use_preview",
                "display_width": VIEWPORT["width"],
                "display_height": VIEWPORT["height"],
                "environment": "browser",
            }],
            input=[
                {"role": "user", "content": task},
                {"role": "user", "content": [
                    {"type": "input_image", "image_data": screenshot}
                ]},
            ],
            truncation="auto",
        )

        steps = 0
        while steps < max_steps:
            calls = [o for o in response.output if o.type == "computer_call"]
            if not calls:
                # Model emitted a final text answer
                final = next(
                    (o for o in response.output if o.type == "message"), None
                )
                await browser.close()
                return {"ok": True, "steps": steps, "answer": final}

            call = calls[0]
            action = call.action.dict()

            if await is_destructive(page, action):
                await browser.close()
                return {"ok": False, "steps": steps, "blocked": action}

            # Dispatch the action
            if action["type"] == "click":
                await page.mouse.click(action["x"], action["y"])
            elif action["type"] == "type":
                await page.keyboard.type(action["text"])
            elif action["type"] == "scroll":
                await page.mouse.wheel(action["dx"], action["dy"])
            elif action["type"] == "keypress":
                for k in action["keys"]:
                    await page.keyboard.press(k)
            elif action["type"] == "wait":
                await page.wait_for_timeout(750)

            await page.wait_for_load_state("domcontentloaded")
            screenshot = await screenshot_b64(page)

            # Send screenshot back as the call output
            response = client.responses.create(
                model=MODEL,
                previous_response_id=response.id,
                input=[{
                    "type": "computer_call_output",
                    "call_id": call.call_id,
                    "output": {"type": "input_image", "image_data": screenshot},
                }],
                truncation="auto",
            )
            steps += 1

        await browser.close()
        return {"ok": False, "steps": steps, "answer": None}

Three details that matter and that I see teams skip:

previous_response_id for chaining. Do not re-send the full history each turn — pay for cache and stay under context limits.
truncation="auto". Long browser tasks accumulate dozens of screenshots. Auto-truncation keeps the most recent N images and the original task in scope.
Guardrail before action. The destructive check is not in the prompt. Prompt-only guardrails fail. We check what is actually under the click point in the live DOM, then refuse.

The 10-Task Benchmark

A defensible eval needs ground truth. We built a small harness — 10 web tasks across 5 sites — and graded each run on success, step count, and dollar cost. The tasks:

#	Site	Task	Ground truth
1	a pricing page	"Return the cheapest paid plan name + monthly price"	"Hobby, $9/mo"
2	docs site	"Find the chunk size default for the text splitter"	"1000"
3	github	"Open issue #142 and return its title"	issue title string
4	wikipedia	"When was the GIL removed from CPython?"	"Python 3.13, 2025"
5	hn	"Top story title right now"	runtime check
6	airline (mock)	"Cheapest direct flight JFK→LAX next Tue"	dataset answer
7	settings page	"Toggle dark mode on" (no destructive)	DOM assertion
8	spreadsheet web	"Sum of column B in this sheet"	computed value
9	search engine	"First result for 'OpenAI agents SDK changelog'"	URL match
10	form	"Fill name + email and submit"	success page

Each task ships with: a starting URL, a deterministic grader (regex, JSON match, or DOM-state probe), and a 25-step cap. We run every task 3 times to capture variance.

import json, time, statistics

TASKS = json.load(open("tasks.json"))

def grade(task, result):
    if not result["ok"] or not result["answer"]:
        return False
    text = result["answer"].content[0].text.lower()
    return any(p in text for p in task["accept_patterns"])

records = []
for task in TASKS:
    runs = []
    for _ in range(3):
        t0 = time.time()
        r = asyncio.run(run_cua(task["prompt"], task["url"]))
        runs.append({
            "ok": grade(task, r),
            "steps": r["steps"],
            "wall_s": round(time.time() - t0, 1),
        })
    records.append({"id": task["id"], "runs": runs})

What We Measured

Over 30 runs (10 tasks × 3 trials) on computer-use-preview-2026-03-11, with viewport 1280x800 and max_steps=25:

Metric	Result
Task success rate (any-of-3)	70%
Task success rate (majority-of-3)	55%
Mean steps per successful task	11.4
p95 steps per successful task	22
Mean wall time per successful task	47 s
Mean cost per successful task	$0.27
Cost per failed task (still pays for screenshots)	$0.31
Destructive action attempts caught by guardrail	2 of 30

The two destructive attempts: on the form task, the model once tried to click a "Delete draft" button after submission "to clean up." On the settings task, it once tried to log out instead of toggling dark mode. Both blocked. Both would have shipped without the guardrail.

Per-task variance was the surprise. The pricing-page task succeeded 3/3 times. The spreadsheet sum task succeeded 0/3 — the model could read the column but consistently miscounted by skipping a row that scrolled out of view. We tagged that one for "needs DOM-text fallback" rather than "fix the prompt."

CUA vs Anthropic Computer Use vs Vision-Only

There are three viable approaches to "agent uses a computer" in 2026. They are not interchangeable.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Capability	OpenAI CUA (`computer-use-preview`)	Anthropic Computer Use (`claude-opus-4-7`)	Vision-only + custom action layer
Native action vocabulary	click, type, scroll, key, wait	click, type, scroll, key, screenshot, mouse_move	You define it
Coordinates returned	viewport-relative pixels	viewport-relative pixels	Whatever you ask for
Multi-app desktop	Browser-only (preview)	Full desktop supported	Whatever you wire
Built-in safety classifier	Yes (server side)	Yes ("safety acknowledgment")	None
10-task suite success	55% majority-of-3	62% majority-of-3	38% majority-of-3
Cost per task	$0.27	$0.41	$0.19 (no native action loop)
Best fit	Browser workflows, scripted SaaS	Desktop + browser, longer tasks	Custom envs, high-volume single-domain

The numbers above are from our 10-task suite; your mileage will vary. The qualitative summary: Anthropic's model is currently a touch better at multi-step browser tasks and noticeably better at desktop, OpenAI's is faster and cheaper, and rolling your own vision loop only wins when you have one narrow domain and serious volume. We use OpenAI CUA for the voice agent's web-tool fallback and Anthropic CU for one-off internal automations.

Production Constraints That Are Not Optional

If you take nothing else from this post, take these.

Step budget per task. We use 25. Anything that does not finish in 25 is a failure. Without a budget, runaway loops are a real bill.
Per-action guardrail. The model is not in charge of safety. Match destructive verbs in the live DOM under the click target, and refuse. Prompt-only "do not delete anything" instructions fail at a measurable rate.
Domain allowlist. The browser context starts on the task URL and we block navigation outside the allowlisted host. Otherwise the model will follow a "click here" link and you will be paying for it to read Reddit.
Pin the model. computer-use-preview-2026-03-11, not computer-use-preview. The action vocabulary has changed twice during preview.
Headless detection. Some sites block headless Chromium. We have a small library of stealth flags and a fallback to headed mode for known-blocking domains.
Re-eval on every model snapshot. Re-run the 30-run benchmark whenever OpenAI bumps the snapshot. We treat the suite the same way we treat our LangSmith CI gate for text agents.

Frequently Asked Questions

Should I use CUA or write a Playwright script?

If the workflow is stable and high-volume, a Playwright script is faster, cheaper, and 100% reliable. CUA is for tasks where the page changes, where you have many sites with the same shape, or where you cannot maintain selectors. The honest test: if you would rather pay an intern $15/hr to do this 200 times, write the script. If the intern would also fail because the UI varies, CUA is the call.

Why is success only 55%?

Two main failure modes. (1) Long pages where the answer is below the fold and the model declares victory before scrolling enough. (2) Visually crowded interfaces where the click coordinate lands one pixel off the right element. Both improve with each model snapshot, neither is "solved" yet.

Can I run this on real customer accounts?

Only if you (a) sandbox the credentials, (b) keep the destructive guardrail, (c) record every trace, and (d) have a human-in-the-loop review for first-time runs on a new site. We treat CUA on customer data the same way we treat any operator with shell access.

How do I evaluate non-deterministic page content (e.g., HN top story)?

Two-stage grading. Run a deterministic fetch (HTTP) at the same instant the agent starts to capture the ground truth. The grader checks the agent's answer against that snapshot, not against a hardcoded string.

What about latency?

47 seconds mean per successful task is not interactive. CUA is for batch automation, not real-time chat. If a customer is waiting, route to a regular tool-calling agent and call CUA out-of-band.

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

TL;DR

Why Computer Use Is Different From Normal Tool Calling

The Loop, Drawn

Building the Agent (Working Code)

The 10-Task Benchmark

What We Measured

CUA vs Anthropic Computer Use vs Vision-Only

Production Constraints That Are Not Optional

Frequently Asked Questions

Should I use CUA or write a Playwright script?

Why is success only 55%?

Can I run this on real customer accounts?

How do I evaluate non-deterministic page content (e.g., HN top story)?

What about latency?

Try CallSphere AI Voice Agents

Related Articles You May Like

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

How to Build a Golden Dataset for Production AI Agents

Regression Testing for AI Agents: Catching Silent Breakage Before Users Do