Prompt and Context Design for Claude Computer Use
Context engineering for Claude computer use: what to include, what to leave out, rolling summaries, prompt structure, and how to tune it empirically.
The hardest engineering problem in computer use is not getting Claude to click — it is deciding what to show the model on each turn. Feed it too little and it acts blind; feed it too much and you drown the signal, blow the context window, and pay for tokens that make the agent worse, not better. Context design is where computer-use agents are won or lost, and it is the part most teams under-invest in.
This post is a focused treatment of that problem: what belongs in a computer-use agent's context, what to deliberately exclude, and how to structure the prompt so the model stays grounded over a long task.
Context is a budget, not a bucket
The instinct is to give the model everything — every screenshot, the full DOM, the entire action history — on the theory that more information helps. It does not. Models attend better to lean, relevant context, and every extra token is both a cost and a distraction. The right mental model is a budget you allocate deliberately: a fixed slice for the system prompt, a slice for the current visual state, a small slice for recent history, and a hard cap on the rest.
This matters even with very large context windows. A bigger window lets you go longer before pruning, but it does not repeal the principle that irrelevant content degrades attention and inflates cost. Treat context as expensive even when it is technically available, and spend it on what the model needs to decide the next action.
What to put in
Four things earn a place in nearly every turn. First, a tight system prompt: the agent's role, the hard rules, and the recovery instructions. Second, the current visual state — the latest screenshot, and for browser work a trimmed accessibility tree of just the interactive elements. Third, a short rolling summary of what has happened so far, so the model has continuity without carrying every frame. Fourth, the concrete success criterion, restated, so the agent always knows what "done" means.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Assemble turn context"] --> B["System: role + rules + recovery"]
A --> C["Current screenshot"]
A --> D["Trimmed element list"]
A --> E["Rolling text summary of prior steps"]
A --> F["Restated success criterion"]
B --> G["Send to Claude"]
C --> G
D --> G
E --> G
F --> G
G --> H["Model returns next action"]
Notice what each input is for. The screenshot grounds spatial reasoning; the element list makes targeting precise; the summary preserves narrative; the criterion prevents drift. Each earns its tokens by changing the quality of the next decision.
What to leave out
The discipline is in the exclusions. Drop old screenshots — keep only the last one or two at full resolution and replace the rest with one-line summaries, because stale frames mislead the model into reasoning about a screen that no longer exists. Drop the full raw DOM; ship only the interactive, visible elements, since a megabyte of markup buries the three buttons that matter. Drop verbose internal logs and stack traces from tool errors; the model needs a clean structured error, not your server's traceback.
Also leave out anything the model should never see: credentials, secrets, and unrelated user data. This is not only security hygiene — it is context hygiene. Sensitive data in context is a prompt-injection target and pure noise for the task at hand. The best computer-use agents are aggressive about what they refuse to include.
Structure the prompt for grounding
Order and labeling matter. Put the stable system prompt first, then clearly delimited blocks for visual state, history, and goal, each labeled so the model knows what it is looking at. When you include both a screenshot and an element list, tell the model how they relate — that the list enumerates the clickable items in the image — so it cross-references rather than treating them as separate worlds.
Be explicit about freshness. State plainly that the screenshot reflects the current screen as of this turn and that anything in the summary is historical. Agents lose the plot when they cannot tell what is now from what was earlier; an explicit temporal frame keeps them grounded. These are small wording choices with outsized effects on accuracy.
Manage history with rolling summaries
The naive history strategy — append every turn forever — fails on any task longer than a handful of steps. The robust pattern is a rolling summary: as old turns age out of the full-fidelity window, compress them into a running narrative the model can carry cheaply. "Navigated to catalog, filtered to desks under $300, three candidates found, comparing prices" is a few tokens and tells the model everything it needs about the past without ten screenshots.
Decide what the summary must preserve: decisions made, facts discovered, and the current sub-goal. Decide what it can drop: the pixel-level detail of screens already acted on. A good summary is the agent's working memory, and maintaining it well is what lets a computer-use agent run for fifty steps without losing coherence or exhausting its budget.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Test context choices empirically
Context design is not a thing you reason your way to once and freeze. Build a small eval set of representative tasks and measure how changes to context affect success rate, step count, and cost. Try shipping the element list without the screenshot on web tasks; try a tighter summary; try a smaller image resolution. The right context recipe is specific to your environment, and the only way to find it is to measure. Teams that treat context as a tunable, evaluated parameter end up with agents that are both cheaper and more reliable than teams that guess — which is the whole game in computer use.
Frequently asked questions
Should I always include a screenshot?
Not always. For browser tasks where a trimmed accessibility tree captures the interactive elements, you can often skip or downsize the screenshot and save significant tokens. Keep full screenshots for visually complex or canvas-based interfaces where the element list is insufficient.
How much history should I keep at full fidelity?
Usually just the last one or two turns with full screenshots; compress everything older into a rolling text summary. This keeps context roughly flat over a long task and prevents stale frames from misleading the model.
Does a bigger context window mean I can skip pruning?
No. A larger window buys you headroom before pruning, but irrelevant context still degrades attention and inflates cost. Spend context deliberately regardless of how much is technically available.
How do I know my context recipe is good?
Measure it. Run a small eval set and track success rate, step count, and token cost as you vary what you include. The optimal recipe is environment-specific, so empirical tuning beats intuition every time.
Bringing agentic AI to your phone lines
CallSphere applies the same context discipline to voice and chat agents — show the model exactly what it needs to decide the next move — so they answer every call and message and book work 24/7. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.