---
title: "Claude Computer Use: A Real End-to-End Walkthrough"
description: "One workflow from no-API portal pain to a shipped, autonomous Claude computer-use automation — the spec, harness, failures, and metrics that proved it works."
canonical: https://callsphere.ai/blog/claude-computer-use-a-real-end-to-end-walkthrough
category: "Agentic AI"
tags: ["agentic ai", "claude", "computer use", "automation", "use case", "case study", "anthropic"]
author: "CallSphere Team"
published: 2026-04-26T17:46:22.000Z
updated: 2026-06-07T01:28:23.417Z
---

# Claude Computer Use: A Real End-to-End Walkthrough

> One workflow from no-API portal pain to a shipped, autonomous Claude computer-use automation — the spec, harness, failures, and metrics that proved it works.

Most writing about computer use stops at the demo: 'look, Claude filled in a form.' The interesting part is everything after the demo — the two weeks of edge cases, the failure that nearly shipped, and the unglamorous harness work that turns a clever trick into something you trust to run while you sleep. So instead of describing the capability in the abstract, this is one workflow followed from the original problem all the way to a deployed, measured automation. The names of the systems are generic on purpose; the path is what matters, and it generalizes.

## The problem: a portal with no API

The team processes supplier certifications. Every week, a few hundred PDF certificates arrive by email, and each one has to be entered into a vendor portal — a clunky web app with no API, no export, and a session that times out every fifteen minutes. A person opens each PDF, reads the certificate number, expiry date, and issuing body, logs into the portal, finds the supplier, opens their record, and types the three fields in. It takes a contractor about four minutes per certificate and it is mind-numbing, error-prone, and impossible to hire for.

This is the canonical computer-use case: real software, no programmatic interface, a repetitive screen-driven task. You cannot script it with a normal integration because there is nothing to integrate with. You can, however, give Claude a browser and the same instructions you would give the contractor — which is exactly the proposition of computer use, and exactly why the team reached for it instead of a brittle scraper.

## Building the harness before the agent

The mistake would have been to start by prompting Claude to 'enter these certificates.' The team started with the harness — the scaffolding that makes the run observable, bounded, and reversible — and only then pointed Claude at it. The flow they built is shown below.

```mermaid
flowchart TD
  A["New certificate PDF arrives"] --> B["Extract 3 fields with Claude"]
  B --> C{"Fields confident?"}
  C -->|No| D["Queue for human review"]
  C -->|Yes| E["Computer use: log in & open supplier"]
  E --> F{"Supplier record found?"}
  F -->|No| D
  F -->|Yes| G["Type fields, screenshot before submit"]
  G --> H{"Shadow mode? human approves"}
  H -->|Approved| I["Submit & log result"]
  H -->|Rejected| D
```

Two design choices carried the whole project. First, the data extraction and the data entry were split into separate stages, so a low-confidence read never reached the portal — it went to a human queue instead. Second, every run executed in a sandboxed virtual desktop under a portal account scoped to a single supplier group, so even a worst-case wrong click could not touch records outside the job. The harness took longer to build than the prompt and was worth every hour.

## The first week: where it actually broke

In shadow mode — Claude does everything but a human clicks the final submit — the failures were instructive. The session timeout caused roughly one in ten runs to land on a re-login screen mid-task; Claude handled it gracefully once the spec told it to expect and re-authenticate. A subtler failure: two suppliers had nearly identical names, and on ambiguous matches Claude occasionally opened the wrong record. That was a wrong-target failure, the dangerous kind, and it never would have surfaced in a demo. The fix was a stop condition: if more than one supplier matches the name, do not guess — queue for a human. After adding it, wrong-target errors went to zero in the eval set.

The team also learned to trust the confidence split. About 8% of PDFs were scans poor enough that field extraction was shaky; routing those to humans automatically kept bad data out of the portal entirely and meant the agent only ever acted on data it had read cleanly.

One more failure was almost invisible and worth calling out, because it is the kind that erodes trust quietly. On a handful of runs Claude completed the task correctly but also clicked into an adjacent tab to 'double-check' something — a harmless instinct that nonetheless left the record in an unexpected view and confused the next reviewer. That is scope creep: the task succeeded, but the agent did slightly more than asked. The fix was a tighter spec with an explicit list of the only tabs it should touch and a 'do nothing extra' instruction, plus a per-step log that made the stray clicks obvious in review. Scope creep is rarely catastrophic, but left unwatched it accumulates into a system whose state no one fully trusts.

## Going from shadow mode to autonomous

The leash loosened in stages, gated by numbers rather than gut feel. The team built an eval set of 40 historical certificates with known-correct portal outcomes and replayed it after every spec change. When the eval pass rate held above their bar across several consecutive runs and the human reviewer in shadow mode had rejected nothing for a full week, they let submits run autonomously for the high-confidence path — while keeping the human queue for low-confidence reads and ambiguous matches. The irreversible action here (submitting a record) was reversible enough in this portal that autonomy was defensible; had it been a payment, it would have stayed gated indefinitely.

| Stage | Who clicks submit | Gate to advance |
| --- | --- | --- |
| Shadow mode | Human approves every run | Build the eval set, fix early failures |
| Supervised | Human spot-checks a sample | One clean week, zero rejections |
| Autonomous (high-confidence) | Agent, low-confidence still queued | Eval pass rate holds above bar |

## Key takeaways

- The best computer-use targets are real apps with no API and a repetitive screen workflow — exactly where scrapers and integrations fail.
- Build the harness (sandbox, scoped account, logging, human queue) before you write the agent prompt.
- Split perception from action: route low-confidence reads to humans so the agent only acts on data it read cleanly.
- The dangerous failures (wrong target, mid-task re-login) appear in week one, not in the demo — find them in shadow mode.
- Loosen autonomy in stages, gated by an eval pass rate and a clean shadow-mode record, never by gut feel.

## The spec that shipped

Here is the working task spec, trimmed. Note how much of it is failure handling rather than the happy path — that ratio is the signal of a production-ready spec.

```
Goal: Record a supplier certificate in the vendor portal.
Inputs: cert_number, expiry_date, issuer, supplier_name
Steps:
  1. Log in. If a timeout/login screen appears at any point, re-authenticate and continue.
  2. Search supplier_name. If exactly one match, open it. If 0 or 2+, STOP -> human queue.
  3. Open Certifications tab, enter the three fields exactly as given.
  4. Screenshot the filled form before submitting.
Never: edit supplier details, delete a certificate, change another field.
Stop_if: any unexpected modal, any field pre-filled with a different value.
On finish: log cert_number + 'submitted' + screenshot path.
```

## Common pitfalls

- **Prompting before scaffolding.** Starting with 'just do the task' skips the logging and sandbox that make failures visible and safe. Build the harness first.
- **One mega-step instead of perception then action.** Bundling extraction and entry means bad reads reach the live system. Split them and gate on confidence.
- **No stop condition for ambiguity.** Letting the agent guess on a near-duplicate match is how wrong-target errors ship. Make ambiguity a hard stop.
- **Promoting to autonomy on a good demo.** Demos hide the long-tail failures. Promote on eval numbers and a clean shadow week.
- **Ignoring session quirks.** Timeouts, captchas, and re-auth screens are the real world; bake them into the spec instead of pretending they will not happen.

## Frequently asked questions

### How long did this take to ship?

A small team got it from idea to autonomous high-confidence path in a few weeks, with most of the time spent on the harness and shadow-mode hardening rather than the prompt. The capability is fast to demo and slow to trust — budget for the second part.

### Why split extraction and data entry?

Because the two stages have different risk profiles. A bad extraction is recoverable if it never leaves the queue; a bad data entry is in a live system. Splitting them lets you gate on read confidence and keep the agent from ever acting on shaky input.

### Could a normal scraper have done this instead?

Not reliably. The portal had no API, a shifting layout, and session timeouts that break brittle scripts. Computer use adapts to the screen as it finds it, which is the whole reason it fits legacy software that defeats traditional automation.

### What metric finally justified going autonomous?

A stable eval pass rate on 40 known-good cases plus a full week of shadow mode with zero human rejections on the high-confidence path. Two independent signals, both green, before loosening the leash.

## Bringing agentic AI to your phone lines

CallSphere runs this same problem-to-production pattern on **voice and chat** — agents that take a real request, use tools mid-conversation, and ship a booked outcome, hardened the same careful way. See a live walkthrough at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/claude-computer-use-a-real-end-to-end-walkthrough
