---
title: "Test and refine a Claude Agent Skill: a walkthrough"
description: "A concrete walkthrough: scaffold, build an eval set, run it with samples, read trigger vs quality scores, and refine a Claude Agent Skill with skill-creator."
canonical: https://callsphere.ai/blog/test-and-refine-a-claude-agent-skill-a-walkthrough
category: "Agentic AI"
tags: ["agentic ai", "claude", "agent skills", "skill-creator", "evals", "anthropic", "walkthrough"]
author: "CallSphere Team"
published: 2026-02-28T08:23:11.000Z
updated: 2026-06-07T01:28:23.227Z
---

# Test and refine a Claude Agent Skill: a walkthrough

> A concrete walkthrough: scaffold, build an eval set, run it with samples, read trigger vs quality scores, and refine a Claude Agent Skill with skill-creator.

You have written an Agent Skill. It worked the first time you tried it, so you shipped it — and a week later a teammate reports it either didn't fire on an obvious request or fired on something unrelated and derailed their session. This is the universal first lesson of skill development: a skill that works once is not a skill that works. This walkthrough takes a single skill from rough draft to reliably tested using `skill-creator`, with the exact moves an engineer makes at each step and what the output looks like.

We'll use a running example: a `changelog` skill that turns a list of merged commits into customer-facing release notes. It is small enough to follow completely and realistic enough that every step matters.

## Key takeaways

- **Start from behavior, not prose**: write the scenarios before polishing the body, so you're optimizing against a target.
- **Include near-misses**: the prompts that look related but should NOT trigger are how you catch over-eager skills.
- **Run with samples**: every scenario runs several times so you see a rate, not a coin flip.
- **Read the two scores apart**: trigger accuracy and execution quality fail for different reasons and need different fixes.
- **One edit per loop**: change a single thing, re-run the same set, confirm the number moved.

## Step 1: scaffold the skill with skill-creator

Invoke skill-creator and describe what you want. It creates the folder, a `SKILL.md` with YAML frontmatter, and stubs for any reference files. The frontmatter is the only part Claude sees during discovery, so it gets the most scrutiny. A first draft looks like this:

```
---
name: changelog
description: Turn merged commits into customer-facing release notes,
  grouped by change type, omitting internal-only work.
---

When invoked, read the provided commit list. Group entries into
Features, Fixes, and Breaking changes. Drop pure refactors, test,
and CI commits. Write each line in user-facing language, not commit
shorthand. End with an Upgrade notes section if anything breaks.
```

This is a starting point, not the answer. Resist the urge to keep polishing the wording now — you don't yet have a way to know whether it's good.

## Step 2: write the eval scenarios first

Before refining anything, build the scenario set. skill-creator can draft these, but you should curate them, because they are your definition of correct. Cover three categories: prompts that must trigger, prompts that must not, and near-misses that sit right on the boundary.

```
[
  { "prompt": "Write release notes for v4.2 from these 18 commits: ...",
    "should_trigger": true,
    "rubric": ["groups into Features/Fixes/Breaking", "drops refactor+CI commits",
               "user-facing wording", "adds Upgrade notes when API changes"] },
  { "prompt": "Summarize this single PR for the reviewer",
    "should_trigger": false, "rubric": ["does NOT load changelog"] },
  { "prompt": "Turn this sprint's merges into notes for the marketing team",
    "should_trigger": true, "rubric": ["groups by type", "non-technical tone"] }
]
```

The second entry is the one most people forget. Without it, you'd never learn that your description is broad enough to hijack ordinary PR-summary requests.

## Step 3: run the harness with multiple samples

Run the eval set through skill-creator's harness. Each scenario executes several times in fresh sessions, and the harness records the full transcript per run: did the skill load, what did it produce, and how does each run score against the rubric. The aggregated view is what you read.

```mermaid
flowchart TD
  A["Scenario set"] --> B["Run skill N times each"]
  B --> C["Capture transcripts"]
  C --> D["Judge vs rubric"]
  D --> E{"Trigger correct?"}
  E -->|No| F["Edit description"]
  E -->|Yes| G{"Quality >= gate?"}
  G -->|No| H["Edit body / add example"]
  G -->|Yes| I["Skill passes"]
  F --> B
  H --> B
```

A first run on our changelog skill might come back like this: triggers correctly on 8/8 must-trigger prompts, but fires on the "summarize this single PR" near-miss in 3/8 runs, and on the must-trigger marketing prompt only 5/8 runs produce non-technical tone. Now you have two specific problems with specific addresses.

## Step 4: read the scores and locate the fix

The over-trigger on PR summaries is a frontmatter problem — the description says "release notes" but "summarize" leaks into adjacent territory. The marketing-tone miss is a body problem — the instruction mentions "user-facing language" but never distinguishes a developer audience from a marketing one. Notice how each metric points at exactly one file and one edit. That is the whole value of separating the scores.

Resist fixing both at once. If you edit the description and the body in the same pass and the numbers move, you won't know which edit did it, and you'll carry a possibly-harmful change forward. Pick the higher-impact one — here, the over-trigger, since a skill that hijacks unrelated work is worse than one with uneven tone.

## Step 5: refine one layer, then re-run

Tighten the description to exclude single-PR summaries and re-run only to confirm.

```
description: Turn a LIST of merged commits (a release or sprint) into
  customer-facing release notes grouped by type. Not for summarizing
  a single PR or diff.
```

Re-running, the near-miss now stays quiet in 8/8 runs and the must-trigger prompts hold at 8/8. Trigger accuracy is solved without touching execution. Next loop, address the marketing-tone miss: add an explicit instruction and a worked example to the body showing the same change rewritten for a non-technical audience. Re-run, watch that rubric line climb to 8/8, and confirm the others didn't slip. Each pass is one hypothesis tested against the same fixed set.

## Common pitfalls

- **Polishing the body before writing scenarios**: you'll optimize against your imagination instead of measured behavior.
- **Skipping negative scenarios**: an over-triggering skill quietly degrades unrelated tasks and is the hardest failure to notice in production.
- **Changing the eval set and the skill in the same loop**: now your baseline moved too, and the comparison is meaningless.
- **Treating 6/8 as a pass**: a 75% rate is a flaky skill; decide your gate up front and hold to it.
- **Stopping at the first green run**: re-run after every edit, because a fix for one rubric line can quietly regress another.

## The five-step loop, condensed

1. Scaffold the skill with skill-creator and write a first-draft description and body.
2. Curate scenarios — must-trigger, must-not, and near-misses — each with a rubric.
3. Run the harness with several samples per scenario and read aggregated rates.
4. Locate the single weakest signal and identify which file owns it.
5. Make one edit, re-run the same set, keep it only if it improves without regressing.

| Stage | What you produce | What it tells you |
| --- | --- | --- |
| Scaffold | Folder + draft SKILL.md | A testable starting point |
| Scenarios | Pos/neg/near-miss set | Your definition of correct |
| Run | Transcripts + rates | Where it actually fails |
| Refine | One edit per loop | Which change helped |

## Frequently asked questions

### Should I write scenarios before or after the skill body?

Write at least a first scenario set before polishing the body. The scenarios are your target; refining instructions without them means optimizing against guesses rather than measured behavior.

### How do I know whether to edit the description or the body?

Look at which score failed. A wrong or missing trigger is a frontmatter description problem; a loaded-but-weak output is a body or example problem. Each metric points at exactly one file.

### What if a fix for one rubric line breaks another?

That is exactly why you re-run the full set after every edit, not just the scenario you were targeting. Keep the change only if the targeted number improves and none of the others regress.

## Bringing agentic AI to your phone lines

The same build-then-measure loop powers **CallSphere**'s **voice and chat** agents — assistants that answer every call and message, use tools mid-conversation, and book work 24/7, refined against real transcripts. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/test-and-refine-a-claude-agent-skill-a-walkthrough
