Skip to content
Agentic AI
Agentic AI6 min read0 views

A Real Claude Browser-Use Build, Start to Finish

One Claude browser-use automation from a messy business problem to a shipped, supervised outcome — decomposition, dead ends, and what made it stick.

Most writing about computer and browser use stops at the demo. The demo is the easy 20 percent. This post walks the other 80 percent: taking one real, unglamorous problem and following it all the way to something a team relies on every morning. The scenario is composite but every step is the kind teams actually hit. The goal is to show the texture of the work — where it flows, where it fights back, and what specifically turns a clever prototype into a dependable automation.

The problem nobody wanted

A mid-sized services company processed supplier confirmations through a vendor portal that had no API and no intention of ever getting one. Every morning an operations coordinator logged in, found new purchase orders, cross-checked each against an internal spreadsheet, marked matches as confirmed, and flagged mismatches for a buyer. It took ninety minutes a day, it was deadly dull, and it was error-prone precisely because it was dull. This is the ideal shape for a browser-use agent: high-volume, rule-governed, API-less, and forgiving enough that a flagged mismatch is recoverable rather than catastrophic.

The first decision was not technical. It was deciding the success criterion before writing anything: the agent should confirm clean matches autonomously and route everything ambiguous to a human, and it should never mark a mismatch as confirmed. "Never confirm a mismatch" became the north-star invariant the whole build was tuned around.

Decomposing the task

We broke the morning routine into explicit, checkpointable steps rather than handing Claude the whole job at once. Decomposition is the unglamorous core of a good build: each step has a clear input, a clear output, and a verification.

flowchart TD
  A["Log in to vendor portal"] --> B["List new purchase orders"]
  B --> C["For each PO: read fields"]
  C --> D{"Match internal record?"}
  D -->|Exact match| E["Mark confirmed"]
  D -->|Mismatch| F["Flag for buyer + note diff"]
  E --> G["Verify status updated"]
  F --> G
  G --> H["Write run summary & trace"]

We built this with the Claude Agent SDK, giving Claude browser-use access plus a small set of trusted tools: a function to look up the internal record by PO number, and a function to file a buyer flag. That last detail mattered more than it looks. Rather than have the agent navigate yet another UI to flag mismatches, we gave it a clean tool for the consequential write. The principle: let Claude use the browser for reading and the API for the dangerous writing whenever you can. Browser use is for surfaces you cannot reach otherwise, not a reason to do everything through a cursor.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The dead ends that taught us the most

The first version failed in an instructive way. The portal paginated purchase orders, and the agent, satisfied after the first page, reported "all POs processed" while ignoring pages two and three. It was confidently wrong. The fix was not a smarter prompt; it was an explicit completion check — read the total count from the portal header, compare it to the number processed, and refuse to declare done until they matched. Verification against a number the agent did not itself produce caught the class of error that prose instructions never reliably will.

The second dead end was timing. Occasionally the portal showed a row before its status field finished loading, and the agent read an empty status as a mismatch. We added a read-back step: after any action, re-read the affected row and confirm the expected state before moving on. This single pattern — act, then verify the world actually changed the way you intended — eliminated most of the remaining noise. It is the browser-agent equivalent of checking a function's return value instead of assuming success.

Adding the human gate

We deliberately did not let the agent run fully autonomous on day one. Using hooks, every "mark confirmed" action paused for the coordinator to approve in a simple review queue. For two weeks the human approved or corrected each decision, and we logged every disagreement. Those disagreements were the real evaluation set: they showed exactly which match rules the agent misunderstood. After the agreement rate on confirmations held steady and high across a few hundred real POs, we loosened the gate — autonomous confirmation for exact matches, human gate retained only for mismatches and edge cases. The agent earned its autonomy with evidence rather than receiving it on faith.

What shipping actually looked like

The shipped system was less a single agent and more a small, observable pipeline. Each morning's run produced a summary the coordinator skimmed in two minutes instead of ninety, plus a full trace stored for audit. The ninety-minute task became a five-minute review. Critically, the value was not just time saved; it was consistency. The agent never got bored on PO number forty and missed a field, which had been the quiet source of real downstream errors.

The lessons generalize. Pick a forgiving, repetitive, API-less task. Define an invariant you will never violate. Decompose into verifiable steps. Use real tools for dangerous writes and the browser only where you must. Verify against ground truth the agent did not generate. And earn autonomy through a supervised period whose disagreements become your eval set. None of that is exotic, and all of it is what separates a build that lasts from a demo that impresses once.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Frequently asked questions

What kind of task is the right first browser-use project?

High-volume, rule-governed, API-less work where mistakes are recoverable — like portal data reconciliation. Avoid first projects that are irreversible or customer-facing; you want a forgiving surface while you learn the failure modes.

Why mix browser use with regular tool calls?

Browser use is for surfaces with no API. For the consequential writes — flagging, confirming, paying — a clean tool or API is safer and more reliable than driving a UI, so use the browser for reading and proper tools for dangerous actions where you can.

How did the team know the agent was ready for autonomy?

A supervised period where a human approved each consequential action and every disagreement was logged. When agreement on confirmations stayed high across hundreds of real cases, autonomy was loosened gradually, edge cases first remaining gated.

What single technique prevented the most errors?

Verifying against ground truth the agent did not produce itself — comparing processed counts to the portal's own total, and reading back row state after each action. It catches confident-wrongness that no prompt phrasing reliably prevents.

Bringing agentic AI to your phone lines

This same problem-to-shipped arc is how CallSphere builds voice and chat agents that answer every call, use tools mid-conversation, and book work 24/7 — supervised first, autonomous once they have earned it. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.