How to Measure AI Agent Success: Metrics That Prove It Works
Adoption isn't success. The outcome metrics, override rates, cost-per-resolution, and leading indicators that prove a Claude agent is actually working.
The Anthropic Economic Index gives a macro picture of where AI is being adopted across work. But "adoption" tells you nothing about whether your agent is any good. A Claude agent can be heavily used and quietly wrong, busy and unhelpful, fast and untrusted. Adoption is not success. If you cannot tell the difference, you will scale something that should have been killed — or kill something that was working.
This post is about the scorecard. Which metrics actually prove an agent is working, which ones are vanity, and how to build a measurement system that catches degradation before your customers do. We'll cover outcome metrics, quality signals, the cost-per-resolution math, and the leading indicators that tell you trouble is coming.
Key takeaways
- Measure outcomes (resolution, accuracy, cost-per-task), not activity (messages, tokens, "hours saved").
- The single most important number is task success rate against ground truth, sampled from live traffic.
- Track the human override/escalation rate — it is your earliest honest signal of agent quality.
- Watch cost-per-resolved-task, not cost-per-token; a cheap run that fails costs more than an expensive run that works.
- Leading indicators (rising retries, longer trajectories, more tool errors) predict failures before outcome metrics move.
- Build a small continuously-sampled eval so quality is monitored, not assumed.
Vanity metrics versus metrics that matter
The most common measurement failure is counting activity. Number of conversations, tokens consumed, "hours saved" — these go up whether the agent is brilliant or useless. They feel like progress and prove nothing. Worse, "hours saved" is usually back-calculated from an assumption, which means it measures your optimism, not the agent.
The metrics that matter are about outcomes and trust. Did the task get resolved correctly? Did a human have to step in? What did a correct resolution cost end to end? An agent handling 10,000 conversations is worthless if a third of its answers are wrong; an agent handling 200 with a 99% success rate and a falling override rate is a quiet win. Always ask what a metric does when the agent gets worse — if it doesn't fall, it's vanity.
The metric hierarchy
Organize your measurement into three layers: outcome metrics (the truth), quality signals (the texture), and leading indicators (the early warning). Outcome metrics are slow and honest. Leading indicators are fast and noisy. You need all three, because the fast signals tell you to look before the slow signals confirm the damage.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Live agent traffic"] --> B["Sample & grade\nvs ground truth"]
B --> C{"Success rate\nabove bar?"}
C -->|Yes| D["Watch leading\nindicators"]
C -->|No| E["Alert + investigate"]
D --> F{"Retries / tool\nerrors rising?"}
F -->|Yes| E
F -->|No| G["Healthy:\nlog & continue"]
E --> H["Re-gate or fix unit"]This loop is the operational heart of measuring an agent. You continuously sample live traffic, grade it against ground truth, and only relax when both the outcome metric and the leading indicators are healthy. The leading-indicator branch is what saves you: tool error rates and retry counts climb days before success rate visibly drops, giving you a head start.
The four metrics every agent needs
If you track nothing else, track these. Task success rate against ground truth, sampled from real traffic — this is the headline. Human override/escalation rate — how often a person had to correct or take over, which is the market's honest vote on quality. Cost per resolved task — total tokens and tool calls for the whole trajectory divided by successfully resolved tasks, not per call. Time to resolution — how long from request to correct outcome.
| Metric | What it proves | Watch for |
|---|---|---|
| Task success rate | Correctness | Drift below threshold |
| Override / escalation rate | Real-world trust | Slow upward creep |
| Cost per resolved task | Economic viability | Cheap-but-failing runs |
| Time to resolution | Experience & speed | Long trajectories = struggle |
Notice cost is per resolved task. A common trap is celebrating low token cost while the agent fails half its tasks — those failures get retried by humans, so the true cost is far higher than the per-token meter suggests. Always divide by successful outcomes.
Reading the trajectory, not just the answer
An agent's path to an answer is as informative as the answer. A correct answer reached after eight tool errors and three retries is a fragile success that will fail under load. Instrument the trajectory: count tool calls, retries, and self-corrections per task. Here is the kind of structured trace worth logging on every run.
{
"task_id": "t-5521",
"resolved": true,
"graded_correct": true,
"tool_calls": 4,
"tool_errors": 2,
"retries": 1,
"trajectory_seconds": 31.4,
"human_override": false,
"cost_tokens": 18840
}Aggregate these and the leading indicators emerge: a rising mean of tool_errors or retries across tasks is your early warning that an upstream system changed or a prompt regressed. You'll see it before graded_correct averages move, which is exactly when you want to intervene.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Common pitfalls in measuring agents
- Counting activity as success. Conversations and tokens rise whether the agent is good or bad. Anchor on outcomes graded against ground truth.
- Measuring cost per token, not per resolution. Failed cheap runs are expensive because humans redo them. Always divide cost by successful outcomes.
- No continuous sampling. A one-time pre-launch eval goes stale the moment systems or models change. Sample live traffic on an ongoing basis.
- Ignoring the override rate. If you only watch automated metrics, you'll miss the humans quietly fixing the agent's mistakes. The override rate is your truth serum.
- Watching only lagging metrics. Success rate moves last. Without leading indicators like retries and tool errors, you find out from an angry customer.
Build your agent scorecard in six steps
- Write down what "resolved correctly" means for your task — that's the ground truth you grade against.
- Set up continuous sampling of live runs and human grading of the sample.
- Instrument every run with the structured trace: tool calls, errors, retries, time, cost, override.
- Define your four headline metrics and a threshold for each.
- Add alerts on leading indicators (rising tool errors/retries) so you look before outcomes drop.
- Review weekly; re-gate or fix the specific unit whose signal slipped, and add the failure to your eval set.
Frequently asked questions
What is the single best metric for an AI agent?
Task success rate measured against ground truth on sampled live traffic. It directly answers whether the agent does the job correctly, which no activity metric can. Pair it with the human override rate so you also capture quality the automated grade might miss.
How is measuring agent success different from the Anthropic Economic Index?
The Index measures macro adoption — where and how much AI is used across occupational tasks. Your scorecard measures whether one specific agent works: correctness, cost per resolution, and trust. Adoption can be high while quality is poor, so you need both lenses for different decisions.
Why measure cost per resolved task instead of per token?
Because failed runs don't disappear — a human redoes them, so a cheap-but-wrong agent is expensive once you count the cleanup. Dividing total trajectory cost by successfully resolved tasks reflects the real economics and stops you from optimizing token price at the expense of correctness.
What are leading indicators for an agent and why do they matter?
Leading indicators are fast-moving signals — rising tool errors, more retries, longer trajectories — that predict trouble before outcome metrics like success rate fall. They give you days of warning to investigate an upstream change or prompt regression instead of learning about it from a customer complaint.
Bringing agentic AI to your phone lines
CallSphere instruments every voice and chat agent with exactly this scorecard — resolution rate, override rate, and cost per resolved call — so you can prove the automation is working, not just busy. See the live dashboards at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.