By Sagar Shankaran, Founder of CallSphere
Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.
Key takeaways
OpenAI's computer_use_preview tool gives a model a hand on a real browser: it sees a screenshot, decides to click at (x, y), type a string, scroll, or press a key, and then sees the next screenshot. That is the entire interface, and it changes the math on what you can automate. The catch is that the model is right about 55–70% of the time on real tasks, depending on the website, and the only honest way to ship CUA into production is with a measured benchmark, hard guardrails on destructive actions, and a budget for retries. This post walks through the working agent loop with the OpenAI Agents SDK and Playwright, then hands you a 10-task evaluation harness that produces a number you can defend in a release review. Real cost: about $0.18–$0.42 per successful task on computer-use-preview-2026-03-11. Real failure modes included.
Normal agents call typed functions: get_calendar(date), book_appointment(slot_id). The model never touches the UI. Computer-use agents flip that: the model is given pixels and asked to act on the pixels. There are three reasons to care.
The price you pay: the model sometimes clicks the wrong thing, sometimes hallucinates that it succeeded, and sometimes tries to take destructive actions. You cannot ship this without an eval and without guardrails. We learned that pattern across our browser-driven outreach automation and it generalizes.
flowchart LR
A[User task] --> B[Take screenshot]
B --> C[Send to computer-use-preview]
C --> D{Model output}
D -->|action: click x,y| E[Playwright click]
D -->|action: type text| F[Playwright type]
D -->|action: scroll dx,dy| G[Playwright scroll]
D -->|action: key Enter| H[Playwright press]
D -->|done + final answer| Z[Return result]
E --> I[Guardrail check]
F --> I
G --> I
H --> I
I -->|safe| B
I -->|destructive| X[Block + ask human]
style A fill:#fee
style Z fill:#cfc
style X fill:#fcc
Figure 1 — The CUA loop. Every action passes through a guardrail before Playwright executes it. Every step appends a fresh screenshot to the conversation, so the model always sees the latest state.
The conversation grows by one image per step. By step 30 you are paying for 30 screenshots in context, which is why a step budget is non-negotiable.
The pinned model is computer-use-preview-2026-03-11. The browser is Playwright Chromium at 1280x800 — that resolution matters because the model returns pixel coordinates relative to the viewport you declared.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
import asyncio, base64
from openai import OpenAI
from playwright.async_api import async_playwright
client = OpenAI()
MODEL = "computer-use-preview-2026-03-11"
VIEWPORT = {"width": 1280, "height": 800}
DESTRUCTIVE_KEYWORDS = (
"delete account", "cancel subscription", "wire transfer", "transfer funds",
"logout", "sign out", "remove user", "purchase", "buy now",
)
async def screenshot_b64(page):
png = await page.screenshot(type="png")
return base64.b64encode(png).decode()
async def is_destructive(page, action):
if action["type"] == "click":
# Read the text under the click target
el = await page.evaluate(
"([x,y]) => document.elementFromPoint(x,y)?.innerText || ''",
[action["x"], action["y"]],
)
return any(k in (el or "").lower() for k in DESTRUCTIVE_KEYWORDS)
return False
async def run_cua(task: str, start_url: str, max_steps: int = 25):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
ctx = await browser.new_context(viewport=VIEWPORT)
page = await ctx.new_page()
await page.goto(start_url, wait_until="domcontentloaded")
# Seed the conversation with the task + first screenshot
screenshot = await screenshot_b64(page)
response = client.responses.create(
model=MODEL,
tools=[{
"type": "computer_use_preview",
"display_width": VIEWPORT["width"],
"display_height": VIEWPORT["height"],
"environment": "browser",
}],
input=[
{"role": "user", "content": task},
{"role": "user", "content": [
{"type": "input_image", "image_data": screenshot}
]},
],
truncation="auto",
)
steps = 0
while steps < max_steps:
calls = [o for o in response.output if o.type == "computer_call"]
if not calls:
# Model emitted a final text answer
final = next(
(o for o in response.output if o.type == "message"), None
)
await browser.close()
return {"ok": True, "steps": steps, "answer": final}
call = calls[0]
action = call.action.dict()
if await is_destructive(page, action):
await browser.close()
return {"ok": False, "steps": steps, "blocked": action}
# Dispatch the action
if action["type"] == "click":
await page.mouse.click(action["x"], action["y"])
elif action["type"] == "type":
await page.keyboard.type(action["text"])
elif action["type"] == "scroll":
await page.mouse.wheel(action["dx"], action["dy"])
elif action["type"] == "keypress":
for k in action["keys"]:
await page.keyboard.press(k)
elif action["type"] == "wait":
await page.wait_for_timeout(750)
await page.wait_for_load_state("domcontentloaded")
screenshot = await screenshot_b64(page)
# Send screenshot back as the call output
response = client.responses.create(
model=MODEL,
previous_response_id=response.id,
input=[{
"type": "computer_call_output",
"call_id": call.call_id,
"output": {"type": "input_image", "image_data": screenshot},
}],
truncation="auto",
)
steps += 1
await browser.close()
return {"ok": False, "steps": steps, "answer": None}
Three details that matter and that I see teams skip:
previous_response_id for chaining. Do not re-send the full history each turn — pay for cache and stay under context limits.truncation="auto". Long browser tasks accumulate dozens of screenshots. Auto-truncation keeps the most recent N images and the original task in scope.A defensible eval needs ground truth. We built a small harness — 10 web tasks across 5 sites — and graded each run on success, step count, and dollar cost. The tasks:
| # | Site | Task | Ground truth |
|---|---|---|---|
| 1 | a pricing page | "Return the cheapest paid plan name + monthly price" | "Hobby, $9/mo" |
| 2 | docs site | "Find the chunk size default for the text splitter" | "1000" |
| 3 | github | "Open issue #142 and return its title" | issue title string |
| 4 | wikipedia | "When was the GIL removed from CPython?" | "Python 3.13, 2025" |
| 5 | hn | "Top story title right now" | runtime check |
| 6 | airline (mock) | "Cheapest direct flight JFK→LAX next Tue" | dataset answer |
| 7 | settings page | "Toggle dark mode on" (no destructive) | DOM assertion |
| 8 | spreadsheet web | "Sum of column B in this sheet" | computed value |
| 9 | search engine | "First result for 'OpenAI agents SDK changelog'" | URL match |
| 10 | form | "Fill name + email and submit" | success page |
Each task ships with: a starting URL, a deterministic grader (regex, JSON match, or DOM-state probe), and a 25-step cap. We run every task 3 times to capture variance.
import json, time, statistics
TASKS = json.load(open("tasks.json"))
def grade(task, result):
if not result["ok"] or not result["answer"]:
return False
text = result["answer"].content[0].text.lower()
return any(p in text for p in task["accept_patterns"])
records = []
for task in TASKS:
runs = []
for _ in range(3):
t0 = time.time()
r = asyncio.run(run_cua(task["prompt"], task["url"]))
runs.append({
"ok": grade(task, r),
"steps": r["steps"],
"wall_s": round(time.time() - t0, 1),
})
records.append({"id": task["id"], "runs": runs})
Over 30 runs (10 tasks × 3 trials) on computer-use-preview-2026-03-11, with viewport 1280x800 and max_steps=25:
| Metric | Result |
|---|---|
| Task success rate (any-of-3) | 70% |
| Task success rate (majority-of-3) | 55% |
| Mean steps per successful task | 11.4 |
| p95 steps per successful task | 22 |
| Mean wall time per successful task | 47 s |
| Mean cost per successful task | $0.27 |
| Cost per failed task (still pays for screenshots) | $0.31 |
| Destructive action attempts caught by guardrail | 2 of 30 |
The two destructive attempts: on the form task, the model once tried to click a "Delete draft" button after submission "to clean up." On the settings task, it once tried to log out instead of toggling dark mode. Both blocked. Both would have shipped without the guardrail.
Per-task variance was the surprise. The pricing-page task succeeded 3/3 times. The spreadsheet sum task succeeded 0/3 — the model could read the column but consistently miscounted by skipping a row that scrolled out of view. We tagged that one for "needs DOM-text fallback" rather than "fix the prompt."
There are three viable approaches to "agent uses a computer" in 2026. They are not interchangeable.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
| Capability | OpenAI CUA (computer-use-preview) |
Anthropic Computer Use (claude-opus-4-7) |
Vision-only + custom action layer |
|---|---|---|---|
| Native action vocabulary | click, type, scroll, key, wait | click, type, scroll, key, screenshot, mouse_move | You define it |
| Coordinates returned | viewport-relative pixels | viewport-relative pixels | Whatever you ask for |
| Multi-app desktop | Browser-only (preview) | Full desktop supported | Whatever you wire |
| Built-in safety classifier | Yes (server side) | Yes ("safety acknowledgment") | None |
| 10-task suite success | 55% majority-of-3 | 62% majority-of-3 | 38% majority-of-3 |
| Cost per task | $0.27 | $0.41 | $0.19 (no native action loop) |
| Best fit | Browser workflows, scripted SaaS | Desktop + browser, longer tasks | Custom envs, high-volume single-domain |
The numbers above are from our 10-task suite; your mileage will vary. The qualitative summary: Anthropic's model is currently a touch better at multi-step browser tasks and noticeably better at desktop, OpenAI's is faster and cheaper, and rolling your own vision loop only wins when you have one narrow domain and serious volume. We use OpenAI CUA for the voice agent's web-tool fallback and Anthropic CU for one-off internal automations.
If you take nothing else from this post, take these.
computer-use-preview-2026-03-11, not computer-use-preview. The action vocabulary has changed twice during preview.If the workflow is stable and high-volume, a Playwright script is faster, cheaper, and 100% reliable. CUA is for tasks where the page changes, where you have many sites with the same shape, or where you cannot maintain selectors. The honest test: if you would rather pay an intern $15/hr to do this 200 times, write the script. If the intern would also fail because the UI varies, CUA is the call.
Two main failure modes. (1) Long pages where the answer is below the fold and the model declares victory before scrolling enough. (2) Visually crowded interfaces where the click coordinate lands one pixel off the right element. Both improve with each model snapshot, neither is "solved" yet.
Only if you (a) sandbox the credentials, (b) keep the destructive guardrail, (c) record every trace, and (d) have a human-in-the-loop review for first-time runs on a new site. We treat CUA on customer data the same way we treat any operator with shell access.
Two-stage grading. Run a deterministic fetch (HTTP) at the same instant the agent starts to capture the ground truth. The grader checks the agent's answer against that snapshot, not against a hardcoded string.
47 seconds mean per successful task is not interactive. CUA is for batch automation, not real-time chat. If a customer is waiting, route to a regular tool-calling agent and call CUA out-of-band.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
OpenAI's Frontier platform makes model-native orchestration the default. What that means for agent builders, voice/chat buyers, and the build-vs-buy decision.
The 2026 desktop AI agent landscape — ServiceNow Project Arc, Anthropic Claude offerings, OpenAI agents, and Google Mariner. A buyer's map.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI