Prompt & Context Design for Claude Computer Use Agents
Context design for Claude computer use agents: system prompt structure, screenshot budgeting, grounding hints, and what to leave out of the window and why.
Two computer-use agents with identical code can behave completely differently because of one thing: what is in their context. The model acts only on what it can see in the conversation — the system prompt, the tool definitions, the goal, and the screenshots you have fed back. Get the context right and a mediocre setup becomes reliable; get it wrong and the best harness in the world still mis-clicks, loops, or burns your budget on stale images. This post is about deliberate context design: what belongs in the window, what to leave out, and why each choice matters.
Key takeaways
- The context window is the agent's entire world — design it on purpose, don't let it accumulate by accident.
- Put the goal, operating procedure, and grounding hints in; keep them tight and unambiguous.
- Leave out stale screenshots, raw secrets, and irrelevant history — they cost tokens and degrade focus.
- Budget screenshots explicitly: recent images in full, older ones summarized as text.
- Add grounding hints — known UI landmarks, expected states — so the model predicts coordinates more accurately.
The context window is the whole world
It is worth stating plainly: a Claude computer-use agent knows nothing about the machine it is driving except what is present in the conversation context at the moment it takes a turn. There is no hidden memory of earlier screens, no awareness of files it cannot see, no sense of time passing. Every decision is a function of the current window. That reframes context design from a tuning detail into the core of the system — you are literally constructing the agent's reality on each call.
So the question is never just "what should I say to Claude" but "what should the model be looking at, in total, when it decides the next action." Everything that follows is about answering that well.
What to put in
Four things earn their place near the top of the context. The goal, stated concretely and with a clear success condition. The operating procedure — look, act, verify, one action at a time — so behavior is consistent. Grounding hints about the environment: the app being used, where key controls live, what "done" looks like. And the tool definitions, with descriptions written for the model, not for a human reviewer.
SYSTEM = """Goal: export the August invoice from the billing app as PDF.
Environment: a desktop with the billing app already open.
Known landmarks: top nav has 'Invoices'; export is a button
labeled 'Download' on the invoice detail page.
Success: a file dialog confirms 'invoice-august.pdf' saved.
Procedure: screenshot, describe, take ONE action, screenshot,
verify the change before continuing."""
Those landmark hints measurably improve coordinate accuracy because the model is no longer guessing what the UI contains — you have told it, and it confirms against the screenshot.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
What to leave out — and why
Just as important is what you keep out. Three categories actively hurt. Stale screenshots from many steps ago add large token cost while contributing almost nothing — the agent needs the current screen, not the tenth-most-recent one. Raw secrets have no business in the window; they bloat it and risk being echoed. And irrelevant history — long unrelated tangents, verbose logs — dilutes the model's attention on what matters now.
flowchart TD
A["New turn begins"] --> B["Keep: goal + procedure + hints"]
B --> C["Keep: last 2-3 screenshots in full"]
C --> D["Replace: older images with text summary"]
D --> E["Drop: secrets, dead tangents"]
E --> F["Assemble trimmed context"]
F --> G["Call model for next action"]
Think of this as context hygiene you run before every API call. The result is a window that stays small, focused, and cheap even on a long task.
Budgeting screenshots
Images dominate the cost of computer-use context. A practical budget: keep the most recent two or three screenshots at full fidelity, because the model reasons over the current and immediately prior state, and replace older image blocks with a one-line text note of what that step accomplished. A short running summary — "navigated to Invoices, opened August, found Download button" — preserves continuity at a fraction of the token cost of the original images.
The discipline is to never let images accumulate unboundedly. On a 40-step task, full-history screenshots can dwarf everything else in the window; a sliding window plus a text summary keeps the agent both affordable and focused on now.
Grounding hints that improve accuracy
Coordinate prediction is the riskiest part of computer use, and context is where you de-risk it. Beyond naming landmarks, you can describe expected states ("after clicking Download a save dialog appears"), warn about distractors ("ignore the promo banner at the top"), and tell the agent what success looks like so it stops at the right moment. These hints do not replace the screenshot — the model still confirms visually — but they bias its interpretation toward the right elements and cut the rate of confident wrong clicks.
The deeper point is that grounding hints turn an open-ended perception problem into a guided one. You are giving the model a map, then asking it to match the map to the territory it sees.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Keep vs. cut: a quick reference
| Item | Keep or cut | Why |
|---|---|---|
| Goal + success condition | Keep | Defines when to stop |
| Operating procedure | Keep | Consistent behavior |
| Last 2-3 screenshots | Keep | Current perception |
| Screenshots from 10 steps ago | Cut → summarize | High cost, low value |
| API keys / tokens | Cut | Risk + bloat |
| Unrelated logs / tangents | Cut | Dilutes attention |
Common pitfalls
- Letting context grow unbounded. Accumulated screenshots blow the budget and slow every turn. Prune on a sliding window.
- Vague goals. Without a concrete success condition the agent doesn't know when to stop. State exactly what done looks like.
- No grounding hints. Forcing the model to discover the UI cold raises mis-clicks. Name landmarks and expected states.
- Secrets in the prompt. They bloat context and risk being echoed in output. Resolve credentials server-side.
- Keeping full image history. Old screenshots crowd out the current one. Summarize them as text once they are stale.
Design your context in 5 steps
- Write the goal with an explicit, verifiable success condition at the top.
- Add a short operating procedure: observe, act once, verify.
- List grounding hints — app, landmarks, expected states, distractors to ignore.
- Set a screenshot budget: full fidelity for the latest two or three, summaries for the rest.
- Run a context-hygiene pass before every call to drop secrets, stale images, and tangents.
Frequently asked questions
How many screenshots should I keep in context?
Usually the most recent two or three at full fidelity. The model reasons over the current and immediately prior state; older images are better replaced with a one-line text summary of what they accomplished.
Do grounding hints really improve click accuracy?
Yes. Naming UI landmarks and expected states biases the model's interpretation toward the right elements, reducing confident wrong clicks. It still confirms against the screenshot, so hints guide rather than override perception.
Why keep secrets out of context if the tools need them?
Because the model doesn't need them — the tool handler or MCP server injects credentials server-side. Putting them in context wastes tokens and risks the model echoing them in its output.
What's the simplest way to stop context from exploding?
A sliding window: before each call, keep the latest few screenshots, replace older image blocks with short text notes, and maintain a running progress summary. That alone keeps long sessions affordable.
Designing context for live conversations
CallSphere applies this same context discipline to voice and chat agents — feeding them just the right goal, history, and tool results so they act accurately mid-call without drowning in noise. See the approach at work at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.