By Sagar Shankaran, Founder of CallSphere
Claude Opus 4.7 leads the Vals AI Finance Agent benchmark at 64.37%. What the test measures, why finance is harder than retail, and what it means for AI buyers.
Key takeaways
At the New York City briefing on May 5, 2026, Anthropic confirmed that Claude Opus 4.7 leads the Vals AI Finance Agent benchmark at 64.37 percent, making it the most capable model on that evaluation as of that date. The number landed alongside ten pre-built finance agent templates and a Moody's data partnership announced earlier in the week.
For buyers comparing models for financial workflows, 64.37 percent is the headline. What it actually measures is more interesting than the score.
Vals AI runs a finance-specific agent benchmark that tests multi-step reasoning on workflows a junior analyst or associate would encounter: reading 10-Ks and 10-Qs, reconciling figures across documents, building pitchbook sections, running KYC checks, and answering questions where the right answer requires combining tabular data, footnotes, and analyst commentary.
The benchmark grades agents end-to-end, not single-shot. The model has to decide which document to open, which tool to call, when to stop, and how to present the result. That is a much harder bar than a single-turn question.
A score in the mid-60s on finance agent work is not what general-purpose benchmarks would suggest. Frontier models routinely score in the 80s and 90s on factual or reasoning tests. Vals AI is lower because finance work compounds errors.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
A model that is 90 percent reliable per step still drifts below 50 percent on a six-step workflow. 64.37 percent end-to-end is a real number.
Retail and customer-service agents have shorter contexts, fewer documents, and more forgiving ground truth. A small phrasing error in a salon booking does not cost anyone money. A small error in a KYC narrative or a leverage ratio does.
Three structural reasons finance is harder for LLM agents:
These are the same reasons general LLM demos that look magical on consumer queries quietly break on real bank desks.
If you are evaluating models for finance work, the useful framing is not "Opus 4.7 is the best." It is "the ceiling for unsupervised single-agent finance work is in the mid-60s today, and Opus 4.7 is at that ceiling."
That has three implications:
The Opus 4.7 generation introduced longer context windows, stronger tool-use reliability, and better grounded-citation behavior. The Moody's data partnership announced the same week gives Claude grounded financial reference data, which historically is where general models hallucinate.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
In short: better model plus better data plumbing plus task-specific templates. That combination is what moves the Vals AI score from "interesting" to "useful in production."
CallSphere is an AI voice and chat agent platform for customer-facing communication: voice, chat, SMS, and WhatsApp, in 57 plus languages, with around 14 function tools and a HIPAA-friendly posture. We are adjacent to finance back-office agents, not in the same category.
Where this news matters for CallSphere customers:
If you want to see a customer-facing agent live before you decide, book a demo.
Q: Is 64.37 percent good or bad? For end-to-end finance agent work, it is the current state of the art. Single-turn benchmarks score higher. Vals AI is harder because it grades the whole workflow.
Q: Does Opus 4.7 replace human analysts? No. A 64 percent score means a human still reviews and finishes most workflows. The lift is in speed and first-draft quality, not full autonomy.
Q: Is CallSphere built on Claude Opus 4.7? CallSphere is model-flexible. Where Claude is the best fit for a customer workflow, we use it. The voice path uses real-time speech models tuned for low latency.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for multilingual customer support — a May 2026 comparison grounded in current model prices, benchm...
© 2026 CallSphere LLC. All rights reserved.