Claude Computer Use Patterns: Prompts, Tools, Context

Getting a computer-use agent to click a button once is a demo. Getting it to complete a 40-step task reliably, ten times in a row, is engineering. The difference is almost entirely in the patterns you wrap around the raw loop: how you write the system prompt, how you mix visual control with structured tools, how you manage the flood of screenshots, and how you let the agent recover when reality disagrees with its plan. This post collects the reusable patterns I keep reaching for.

Key takeaways

Give the agent an explicit operating procedure in the system prompt — verify before acting, take a screenshot after each step, stop when a stated condition is met.
Mix the computer tool with structured tools so Claude only uses pixels when no API exists.
Manage screenshots aggressively: keep the latest, prune the old, and summarize progress as text.
Build recovery patterns — re-screenshot, scroll to find, retry once — directly into the prompt.
Use a done-signal tool so the agent ends deliberately instead of looping until the step cap.

Pattern 1 — The operating-procedure system prompt

The single highest-leverage change you can make is to stop treating the system prompt as a vague description and start treating it as a procedure manual. Tell the agent how to behave step by step: look first, then act, then verify. Concretely:

SYSTEM = """You control a desktop via the computer tool.
Procedure for every action:
1. Take a screenshot and describe what you see in one line.
2. Decide the single next action toward the goal.
3. Execute exactly one action, then screenshot again.
4. Confirm the screen changed as expected before continuing.
If an action has no visible effect, do NOT repeat it blindly —
scroll or look elsewhere first. Call done() when the goal is
visibly achieved."""

This costs a few hundred tokens and dramatically reduces the two worst failure modes: clicking blind and repeating a failed action. The agent now narrates its perception, which also makes your logs readable when something goes wrong.

Pattern 2 — Hybrid tools: pixels only when needed

Computer use is the most expensive, least reliable way to accomplish anything that has an API. The pattern that scales is to expose both the visual tool and ordinary structured tools, and to instruct Claude to prefer the structured ones. If you can read the database directly, give it a query_orders tool; reserve clicking for the legacy admin panel that has no API.

flowchart TD
  A["Goal"] --> B{"Does an API tool exist?"}
  B -->|Yes| C["Call structured tool"]
  C --> D["Use exact result"]
  B -->|No| E["Take screenshot"]
  E --> F["Predict coordinate & click"]
  F --> G["Screenshot to verify"]
  G --> H{"Verified?"}
  H -->|No| E
  H -->|Yes| D

This routing keeps the slow visual path as a fallback, not the default. In practice a well-designed hybrid agent spends most of its turns on fast structured calls and only drops to pixels for the genuinely GUI-only steps.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Pattern 3 — Screenshot lifecycle management

Screenshots are heavy. A long task accumulates dozens of full-resolution images, and the context cost grows linearly while older images add little value — Claude mostly needs the current screen. The reusable pattern is a sliding window: keep the most recent N screenshots in full, replace older image blocks with a short text placeholder, and let a running summary carry the history.

def prune(messages, keep=3):
    seen = 0
    for msg in reversed(messages):
        for block in msg.get("content", []):
            if isinstance(block, dict) and block.get("type") == "tool_result":
                img = [c for c in block["content"] if c["type"] == "image"]
                if img:
                    seen += 1
                    if seen > keep:
                        block["content"] = [{"type": "text",
                            "text": "[older screenshot pruned]"}]
    return messages

Running this before each API call keeps cost bounded on long sessions while preserving recent visual context. Pair it with a text scratchpad where the agent records what it has accomplished so pruning never loses the thread.

Pattern 4 — Recovery and self-correction

Real UIs misbehave: a modal pops up, a page is still loading, an element is below the fold. Bake recovery into the prompt so the agent does not need a perfect plan. Three reusable moves cover most cases. First, re-observe: if the screen is not what was expected, take another screenshot before doing anything. Second, scroll-to-find: if a target element is not visible, scroll and re-screenshot rather than guessing a coordinate. Third, retry-once-then-escalate: attempt a failed action one more time, and if it still fails, report the blocker instead of looping.

Encoding these as explicit rules turns a brittle agent into a resilient one, because the model has a script for the moments where its mental model and the screen diverge.

Pattern 5 — The explicit done signal

Letting an agent end by simply not calling a tool is unreliable — it may keep poking at a finished screen until the step cap fires. A cleaner pattern is a done tool the agent must call, optionally returning a structured summary. This gives you a clean termination signal, a place to capture the result, and a natural spot to run a verification check before accepting success.

{"name": "done", "description": "Call when the goal is visibly complete.",
 "input_schema": {"type": "object",
   "properties": {"summary": {"type": "string"},
     "success": {"type": "boolean"}},
   "required": ["success"]}}

Now the loop ends when done is called, you log summary, and you can gate acceptance on success plus your own screenshot check.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

When to use each pattern

Pattern	Solves	Cost
Operating-procedure prompt	Blind clicks, blind retries	A few hundred tokens
Hybrid tools	Slow, flaky visual steps	Some integration work
Screenshot pruning	Runaway context cost	A little bookkeeping
Recovery rules	UI surprises	Prompt length
Done signal	Looping past completion	One tool definition

Common pitfalls

Treating the system prompt as decoration. A vague prompt produces a vague agent. Write an explicit step-by-step procedure.
Defaulting to pixels. If you only give the computer tool, every step is slow and risky. Add structured tools and instruct preference.
Never pruning screenshots. Context cost climbs with each image and the agent gets slower. Use a sliding window.
No recovery script. Without retry/re-observe rules, one unexpected modal derails the whole run.
Implicit termination. Relying on "it'll just stop" wastes steps. Make the agent call a done tool.

Frequently asked questions

How long should the system prompt be?

Long enough to state the operating procedure, tool-preference rule, and recovery moves — typically a few hundred to a thousand tokens. That investment pays for itself many times over in reliability.

Should I always combine computer use with structured tools?

Almost always. Pixels are the fallback for interfaces with no API. The more of the task you can route through exact, fast structured tools, the better the agent behaves.

Does pruning screenshots hurt accuracy?

Not if you keep the most recent few and maintain a text summary of progress. Claude mainly needs the current screen; distant history is better captured as text than as heavy images.

Why use a done tool instead of letting the model stop?

An explicit done signal gives a clean termination point, a structured result, and a hook to verify success before accepting it — far more reliable than waiting for the model to simply stop acting.

The same playbook, for conversations

These patterns — explicit procedures, tool preference, recovery, clean termination — are exactly what CallSphere's voice and chat agents run on, calling tools mid-conversation and confirming before they commit. Hear it in action at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Claude Computer Use Patterns: Prompts, Tools, Context

Key takeaways

Pattern 1 — The operating-procedure system prompt

Pattern 2 — Hybrid tools: pixels only when needed

Pattern 3 — Screenshot lifecycle management

Pattern 4 — Recovery and self-correction

Pattern 5 — The explicit done signal

When to use each pattern

Common pitfalls

Frequently asked questions

How long should the system prompt be?

Should I always combine computer use with structured tools?

Does pruning screenshots hurt accuracy?

Why use a done tool instead of letting the model stop?

The same playbook, for conversations

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild