Claude's Computer Use Hits 72.5% on OSWorld — Approaching Human-Level Desktop Operation
Claude Sonnet 4.6 scores 72.5% on the OSWorld benchmark for desktop computer operation, up from under 15% in late 2024, nearly matching human performance.
From 15% to 72.5% in 15 Months
Claude's ability to operate a computer like a human has improved dramatically, with Sonnet 4.6 scoring 72.5% on OSWorld — up from under 15% in late 2024. The benchmark measures an AI's ability to complete real desktop tasks.
What OSWorld Tests
OSWorld evaluates whether an AI can:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- Navigate complex spreadsheets
- Complete web forms
- Switch between applications
- Follow multi-step instructions
- Handle unexpected dialog boxes and errors
A score of 72.5% means Claude can successfully complete nearly three-quarters of these real-world desktop tasks — approaching the level of a competent human operator.
flowchart TD
HUB(("From 15% to 72.5% in 15<br/>Months"))
HUB --> L0["What OSWorld Tests"]
style L0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L1["How They Got Here"]
style L1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L2["Comparison Across Models"]
style L2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L3["Practical Implications"]
style L3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style HUB fill:#4f46e5,stroke:#4338ca,color:#fff
How They Got Here
Two key factors drove the improvement:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Model training improvements in the 4.6 generation focused on spatial understanding and interaction patterns
- Vercept acquisition — the desktop AI startup whose team and technology now contribute directly to Claude's computer use capabilities
Comparison Across Models
| Model | OSWorld Score |
|---|---|
| Claude Sonnet 4.6 | 72.5% |
| Claude Opus 4.6 | 72.7% |
| Previous generation | ~50% |
| Late 2024 | <15% |
Practical Implications
At this performance level, Claude can realistically automate routine desktop work: data entry, form filling, report generation, and application navigation. The gap between "demo impressive" and "production useful" has closed.
Source: Anthropic | NxCode | DataCamp | Natural 20
flowchart LR
IN(["Input prompt"])
subgraph PRE["Pre processing"]
TOK["Tokenize"]
EMB["Embed"]
end
subgraph CORE["Model Core"]
ATTN["Self attention layers"]
MLP["Feed forward layers"]
end
subgraph POST["Post processing"]
SAMP["Sampling"]
DETOK["Detokenize"]
end
OUT(["Generated text"])
IN --> TOK --> EMB --> ATTN --> MLP --> SAMP --> DETOK --> OUT
style IN fill:#f1f5f9,stroke:#64748b,color:#0f172a
style CORE fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style OUT fill:#059669,stroke:#047857,color:#fff
flowchart TD
HUB(("From 15% to 72.5% in 15<br/>Months"))
HUB --> L0["What OSWorld Tests"]
style L0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L1["How They Got Here"]
style L1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L2["Comparison Across Models"]
style L2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
HUB --> L3["Practical Implications"]
style L3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
style HUB fill:#4f46e5,stroke:#4338ca,color:#fff
## Claude's Computer Use Hits 72.5% on OSWorld — Approaching Human-Level Desktop Operation — operator perspective
Behind Claude's Computer Use Hits 72.5% on OSWorld — Approaching Human-Level Desktop Operation sits a smaller, more useful question: which production constraint just got cheaper to solve — first-token latency, language coverage, structured outputs, or tool-call reliability? For CallSphere — Twilio + OpenAI Realtime + ElevenLabs + NestJS + Prisma + Postgres, 37 agents across 6 verticals — the bar for adopting any new model or API is unsentimental: does it shorten the inner loop on a real call, or just on a benchmark?
## What AI news actually moves the needle for SMB call automation
Most AI news is noise. A new benchmark score, a leaderboard reshuffle, a leaked memo — none of it changes whether your AI receptionist books appointments without dropping the call. The handful of things that *do* move production AI voice and chat are concrete: realtime API stability (does the WebSocket survive 5+ minutes without a stall?), language coverage (does it handle 57+ languages with usable accents, or is English the only first-class citizen?), tool-use reliability (does the model actually call the right function with the right argument types under load?), multi-agent handoffs (do specialist agents receive structured context, or just transcripts?), and latency under load (p95 first-token under 800ms when 200 concurrent calls hit the same endpoint?). The CallSphere rule on news is: if it doesn't move at least one of those five numbers in a measurable eval, it's a blog post, not a product change. What to track: provider changelogs for realtime endpoints, tool-call schema changes, language-add announcements, and any deprecation that pins your stack to a sunset date. What to ignore: leaderboard wins on tasks that don't map to your call flow, "agentic" benchmarks that don't measure tool latency, and demos that work because the prompt was hand-tuned for the demo. The teams that ship fastest treat AI news the same way ops teams treat CVE feeds — read everything, act on the small fraction that touches your runtime, archive the rest.
## FAQs
**Q: Why isn't claude's Computer Use Hits 72.5% on OSWorld — Approaching Human-Level Desktop Operation an automatic upgrade for a live call agent?**
A: Most of the time it doesn't, and that's the right starting assumption. The relevant test is whether it improves at least one of: p95 first-token latency, tool-call argument accuracy on noisy inputs, multi-turn handoff stability, or per-session cost. Healthcare deployments use 14 vertical-specific tools alongside post-call sentiment scoring and lead-quality classification.
**Q: How do you sanity-check claude's Computer Use Hits 72.5% on OSWorld — Approaching Human-Level Desktop Operation before pinning the model version?**
A: The eval gate is unsentimental — a regression suite that simulates real call traffic (noisy ASR, partial inputs, tool-call timeouts) measures four numbers, and a candidate has to win on three of four without losing badly on the fourth. Anything else is treated as a blog post, not a stack change.
**Q: Where does claude's Computer Use Hits 72.5% on OSWorld — Approaching Human-Level Desktop Operation fit in CallSphere's 37-agent setup?**
A: In a CallSphere deployment, new model and API capabilities land first in the post-call analytics pipeline (lower stakes, async, easy to roll back) and only later in the live realtime path. Today the verticals most likely to absorb new capability first are Salon and Real Estate, which already run the largest share of production traffic.
## See it live
Want to see salon agents handle real traffic? Walk through https://salon.callsphere.tech or grab 20 minutes with the founder: https://calendly.com/sagar-callsphere/new-meeting.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.