Evals for Agent Skills: Measure Quality, Gate Releases

"It seems better" is not a release criterion, yet it's how most Agent Skills get shipped. Someone tweaks the instructions, runs the Skill on one example, the output looks nicer, and it goes out. Then a different input breaks, a fix for one case quietly regresses three others, and nobody can say whether the Skill is improving or just changing. The cure is an eval loop: a repeatable way to measure quality so that every change is judged against the same bar, and no release ships without clearing it.

This post walks through building an eval loop for a Claude Agent Skill — choosing cases, scoring outputs, accounting for variance, and wiring the whole thing into a release gate. It is the difference between refining a Skill and merely editing it.

Key takeaways

An eval set is a fixed collection of inputs with a way to score each output — build it before you optimize.
Score the right thing: outcomes and tool-call correctness, not surface wording.
Agents are non-deterministic, so run each case several times and look at pass rate, not a single roll.
Use rubric-based LLM judges for open-ended quality and exact checks for anything verifiable.
Gate releases on a threshold and block any change that drops a previously-passing case.

An eval is a repeatable test that runs a Skill against a fixed set of inputs and scores each output against an explicit success criterion, so quality can be tracked as a number across versions rather than judged by impression. Without it, refinement is guesswork; with it, refinement becomes engineering.

What should go into the eval set?

A good eval set is a representative spread, not a pile of easy cases. Include the common path the Skill handles daily, the hard edge cases where it tends to slip, and — crucially — a regression bucket: every bug you've ever fixed, frozen as a case so it can never silently return. Aim for cases that are diverse enough that passing all of them genuinely means the Skill works, not that it memorized a narrow demo.

For each case, decide how it's scored. Some outputs are exactly checkable (did the diff apply? did the JSON validate? was the right tool called with the right argument?). Others are open-ended (is this summary faithful and useful?) and need a rubric. Tag each case with its scoring method up front, because the method shapes how you write the expected result.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Proposed Skill change"] --> B["Run eval set\nN times per case"]
  B --> C{"Verifiable case?"}
  C -->|Yes| D["Exact check:\npass/fail"]
  C -->|No| E["LLM judge\nscores rubric"]
  D --> F["Aggregate pass rate"]
  E --> F
  F --> G{"Meets threshold &\nno regressions?"}
  G -->|No| H["Block release,\ninspect failures"]
  G -->|Yes| I["Ship & record baseline"]

How do I score open-ended outputs reliably?

For outputs without a single correct answer, use an LLM judge with an explicit rubric. The reliability comes from the rubric, not the judge model: vague instructions like "rate quality 1-10" produce noisy, unrepeatable scores, while a rubric that names specific criteria produces consistent ones. Define each criterion as a yes/no question the judge can answer from the output alone.

You are grading an agent's answer against the question and reference notes.
Score each criterion as PASS or FAIL, then give an overall verdict.

Criteria:
1. Factual: every claim is supported by the reference notes (no invention).
2. Complete: addresses every part of the user's question.
3. Scoped: does not include actions or info outside what was asked.
4. Format: valid JSON matching the required schema.

Return: {"factual":"PASS|FAIL","complete":"...","scoped":"...",
         "format":"...","overall":"PASS|FAIL","reason":"one sentence"}

Two practices keep judges honest. First, periodically spot-check the judge's verdicts against your own — if you disagree often, the rubric is ambiguous, so sharpen it. Second, prefer many narrow criteria over one broad score; narrow criteria are easier for the judge to apply consistently and easier for you to act on when one fails.

Why must I run each case more than once?

Because agents are non-deterministic. The same input can pass on one run and fail on the next, so a single pass tells you almost nothing. Run each case several times and report the pass rate. A case that passes 5 of 5 is solid; one that passes 3 of 5 is fragile and will fail in production at roughly that rate. Tracking variance also stops you from celebrating a lucky run or condemning an unlucky one.

This is why "I ran it once and it looked great" is a trap: you measured a single sample of a random variable. The eval loop measures the distribution, which is what your users actually experience over many runs.

How does this become a release gate?

A gate is a rule that a build must satisfy to ship. Define two conditions: an overall threshold (e.g., aggregate pass rate must stay at or above your baseline) and a no-regression rule (no case that previously passed may now fail). Run the eval set on every proposed change to the Skill, compare against the recorded baseline, and block the change if either condition fails. When a change clears the gate, record the new pass rates as the baseline so quality ratchets upward and can't quietly slide back.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Common pitfalls

Grading wording instead of outcomes. An exact-string match flags a correct answer phrased differently. Score what matters — was the task accomplished, was the right tool called — not surface text.
A single run per case. Non-determinism makes one run meaningless. Run several and report pass rate, or you'll ship fragile changes that looked fine once.
Vague judge rubrics. "Rate 1-10" is noise. Use named PASS/FAIL criteria the judge can apply from the output alone.
Never adding regression cases. If fixed bugs don't become permanent cases, they come back. Every fix should leave a test behind.
Eval set frozen forever. Production surfaces new failure modes; if they never enter the eval set, your gate slowly stops reflecting reality. Feed real failures back in.

Stand up an eval loop in 7 steps

Collect 15-30 representative inputs spanning common, edge, and previously-broken cases.
Tag each as exact-check or rubric-judged and write its success criterion.
Build a runner that executes the Skill on every case and collects outputs and tool calls.
Run each case N times (start with 3-5) and compute per-case pass rate.
Score exact cases by check and open-ended cases with a rubric LLM judge.
Set a threshold plus a no-regression rule and record the current results as baseline.
Run the gate on every change; ship only on a pass and feed new production failures back in.

Scoring method by output type

Output type	Scoring method	Example check
Code change	Exact	Tests pass, diff in scope
Tool call	Exact	Right tool, right args
Structured data	Exact	Schema validates
Summary / answer	Rubric judge	Faithful, complete, scoped
Multi-step plan	Rubric judge	Steps valid & ordered

Frequently asked questions

How many cases do I need to start?

Fewer than you think. 15-30 well-chosen cases spanning common, edge, and regression scenarios catch most problems. Grow the set as production reveals new failures rather than front-loading hundreds of shallow cases.

Can an LLM judge be trusted to grade itself?

With a sharp rubric and periodic human spot-checks, yes for open-ended quality. Keep verifiable outputs on exact checks, where there's no judgment to trust, and reserve the judge for genuinely subjective criteria.

What pass rate should gate a release?

Set it to your current baseline and require no regressions, then raise it over time. The exact number matters less than the discipline of never shipping a build that scores below the last one.

How many runs per case is enough?

Start with 3-5. If pass rates are stable across runs, that's plenty; if they swing, add runs until the rate settles. The goal is to measure the distribution your users will actually see.

Bringing agentic AI to your phone lines

CallSphere ships voice and chat agents the same disciplined way — every change measured against a real eval set before it reaches a customer call — so the assistants that answer your phones and book work 24/7 keep getting better, not just different. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Evals for Agent Skills: Measure Quality, Gate Releases

Key takeaways

What should go into the eval set?

How do I score open-ended outputs reliably?

Why must I run each case more than once?

How does this become a release gate?

Common pitfalls

Stand up an eval loop in 7 steps

Scoring method by output type

Frequently asked questions

How many cases do I need to start?

Can an LLM judge be trusted to grade itself?

What pass rate should gate a release?

How many runs per case is enough?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild