Inside skill-creator: architecture of testing Agent Skills
How skill-creator works inside: discovery, the eval harness, trigger vs execution scoring, variance analysis, and the refine loop for Claude Agent Skills.
Most engineers meet Agent Skills as a folder with a SKILL.md and a few scripts, and they assume the hard part is writing good instructions. The hard part is actually the loop that comes after: knowing whether the skill fires when it should, does what you intended, and stays stable across model versions. The skill-creator skill exists to close that loop. It is the meta-skill Claude uses to build, test, and refine other skills, and once you understand its internals you stop treating skill development as prompt guesswork and start treating it as an engineering discipline with measurable inputs and outputs.
This post walks the architecture end to end — how a skill is discovered and loaded, how the eval harness drives it, how scoring turns transcripts into numbers, and how the refine loop feeds those numbers back into edits. The goal is a mental model precise enough that you can reason about why a skill misfires and where in the pipeline to intervene.
Key takeaways
- A skill is metadata plus payload: the YAML frontmatter governs triggering, the body and bundled files govern behavior — and they are tested separately.
- Triggering and execution are two different failure modes; skill-creator measures each with its own eval signal so you fix the right layer.
- The eval harness runs the skill against scenarios, captures full transcripts, and scores them against rubrics rather than exact-match strings.
- Variance is first-class: each scenario runs multiple times so you separate a real regression from sampling noise.
- The refine loop is closed: scores point at a specific section of
SKILL.mdto edit, and you re-run to confirm the change helped.
What skill-creator actually is
An Agent Skill is a folder Claude loads dynamically when its description matches the task at hand. The skill-creator skill is the same kind of object pointed at itself: a skill whose job is authoring, evaluating, and optimizing other skills. When you ask Claude Code or Claude Cowork to "make a skill that drafts release notes" or "why isn't my invoice skill triggering," skill-creator is what loads — bringing instructions for scaffolding the folder, writing the description, building an eval set, running it, and reading the results.
The reason this matters architecturally is that skill-creator treats a skill as two loosely coupled parts. The first is the frontmatter — the name and a one-to-two sentence description that is the only thing Claude sees at discovery time. The second is the payload — the Markdown body, any reference docs, and executable scripts that only load once the skill is selected. Almost every skill problem is a problem in exactly one of those halves, and skill-creator's whole design is about telling them apart.
How a skill is discovered and loaded
Discovery is the cheapest and most failure-prone moment in a skill's life. Claude scans the available skill descriptions, compares them against the live task, and decides whether to pull one in. If the description is vague, two skills collide, or the trigger language doesn't match how users actually phrase requests, the skill never loads and no amount of body quality can save it. Only after selection does Claude read the body and bundled files into context, then act.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["User request"] --> B{"Description matches?"}
B -->|No| C["Skill never loads"]
B -->|Yes| D["Load SKILL.md body + files"]
D --> E["Claude executes steps / tools"]
E --> F["Transcript captured"]
F --> G{"Score vs rubric"}
G -->|Low| H["Refine: edit description or body"]
G -->|High| I["Promote skill"]
H --> BThe diagram makes the two failure modes legible. A request that dies at node B is a triggering failure — fix the description. A request that loads but produces a weak transcript at node E is an execution failure — fix the body, the examples, or the scripts. skill-creator's eval harness instruments both nodes so that a single number doesn't blur the two together.
The eval harness
The eval harness is the engine. Conceptually it takes a set of scenarios — short prompts that represent real usage, including near-misses that should not trigger — and runs the skill against each one in a controlled session. It records the full transcript: whether the skill loaded, which tools were called, what files were read, and the final output. That transcript is the unit of measurement; skill-creator never judges a skill by reading its SKILL.md alone, because the only thing that matters is observed behavior.
A minimal scenario set is just a structured list. The shape below is the kind of thing skill-creator generates and then iterates on:
[
{ "prompt": "Draft release notes for v2.3 from these commits", "should_trigger": true,
"rubric": ["groups changes by type", "omits internal refactors", "includes upgrade notes"] },
{ "prompt": "Summarize this PR for a reviewer", "should_trigger": false,
"rubric": ["does NOT load release-notes skill"] }
]Each entry carries an expectation about triggering and a rubric describing a good execution. The harness runs every scenario, then a judge — Claude grading against the rubric, not a brittle string match — converts each transcript into a pass/fail or graded score per rubric line. Because LLM output is stochastic, the harness runs each scenario several times and aggregates, so you see a rate ("7 of 8 runs grouped changes correctly") rather than a single lucky or unlucky sample.
Scoring and the role of variance
Scoring is where raw transcripts become decisions. skill-creator separates two metrics that are easy to conflate. Trigger accuracy measures how often the skill loads exactly when it should and stays quiet when it shouldn't — this is precision and recall over your scenario set. Execution quality measures, given that the skill loaded, how well it satisfied the rubric. A skill can have great execution and terrible triggering, or vice versa, and you must read them apart.
Variance analysis is the part teams skip and regret. If a scenario passes 6 times and fails 2 times, that 75% is information: the skill is unstable on that input, and shipping it means roughly a quarter of real users hit the bad path. Running each scenario once would have told you either "pass" or "fail" — a coin flip masquerading as a result. By running multiple samples, skill-creator reports a stability band, and you can set a release gate like "every must-trigger scenario at 100%, every quality rubric line above 0.9 mean" with confidence the number means something.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The refine loop
The output of scoring is not a grade, it is a pointer. Because triggering and execution are measured separately and the rubric is line-by-line, a failure localizes to a specific edit. A missed trigger on phrasing you didn't anticipate means widening the description with the words real users used. A rubric line that fails consistently — say "omits internal refactors" — means adding an explicit instruction and a worked example to the body. The loop closes when you re-run the same eval set after the edit and watch that specific number move while the others hold.
This is the discipline that separates a skill that works in the demo from one that works in production: every change is a hypothesis, the eval set is the test, and you keep the change only if the numbers improve without regressing the rest.
Common pitfalls
- Judging the skill by reading it: a clean
SKILL.mdcan still misfire. Trust transcripts, not prose. - Single-run evals: one sample per scenario turns variance into a false verdict. Always run several and read the rate.
- No negative scenarios: if every test should trigger, you can't catch over-triggering, which is how skills start hijacking unrelated tasks.
- Editing description and body together: change one layer per iteration so you know which edit moved which metric.
- Exact-match scoring: rubric-based grading tolerates valid phrasing differences; string matching punishes correct answers and rewards brittle ones.
Apply it in five steps
- Invoke skill-creator and scaffold the skill folder with a first-draft description and body.
- Generate a scenario set that includes positive, negative, and near-miss prompts, each with a rubric.
- Run the eval harness with multiple samples per scenario and capture full transcripts.
- Read trigger accuracy and execution quality separately; find the one weakest signal.
- Make a single targeted edit, re-run the same set, and keep the change only if it improves without regressing.
| Symptom | Failing layer | Fix |
|---|---|---|
| Skill never loads | Frontmatter description | Add real user phrasing, sharpen scope |
| Loads on the wrong task | Description too broad | Narrow wording, add negative scenarios |
| Loads but output is weak | Body / examples | Add explicit steps and a worked example |
| Passes sometimes | Instability (variance) | Constrain the instruction, re-measure rate |
Frequently asked questions
What is the skill-creator skill?
The skill-creator skill is a meta-skill that Claude uses to author, test, and refine other Agent Skills — scaffolding the folder, writing the triggering description, building an eval set, running it, and reading the scores to guide edits.
Why measure triggering and execution separately?
They are independent failure modes. Triggering depends only on the short frontmatter description Claude sees at discovery, while execution depends on the body and scripts loaded after selection. Blending them into one score hides which half to fix.
How many times should each scenario run?
Enough to see stability rather than a single sample — several runs per scenario. The point is to convert a noisy pass/fail into a rate so you can tell a real regression from sampling variance before you ship.
Bringing agentic AI to your phone lines
CallSphere takes these same architecture-first agentic patterns and applies them to voice and chat — assistants that answer every call and message, call tools mid-conversation, and book real work around the clock. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.