Reusable patterns for structuring Claude Agent Skills
Code-level patterns for Agent Skills: description-as-router, progressive disclosure, deterministic scripts, and gradeable instructions tested with skill-creator.
After you have tested a handful of Agent Skills with skill-creator, the failures start to rhyme. The same structural mistakes show up across unrelated skills, and the same structural choices keep evals green. This post is about those reusable patterns — how to shape the description, the body, the bundled scripts, and the way context loads, so that a skill is both effective and measurable. A skill that is hard to evaluate is usually a skill that is badly structured, and fixing the structure fixes both problems at once.
None of this is theoretical. Each pattern below maps directly to a metric skill-creator reports, which is the point: structure the skill so that when something breaks, the eval tells you exactly where.
Key takeaways
- The description is a router, not a summary: write it with the words users say and the boundaries of what it does NOT cover.
- Progressive disclosure keeps context lean: load only the body up front, push reference material into files Claude reads on demand.
- Push determinism into scripts: anything that must be exact belongs in a bundled script, not in prose the model interprets.
- Make every instruction testable: phrase steps so a rubric line can grade them as pass/fail.
- One responsibility per skill: narrow skills trigger cleanly and score sharply; broad ones blur both.
Pattern 1: the description as a router
An Agent Skill is a folder of instructions and resources that Claude loads dynamically when its description matches the task. That makes the description the single most leveraged text in the whole skill — it is the router that decides whether anything else even runs. The pattern that survives evals has three parts: what the skill does, the concrete phrasings users employ, and an explicit negative boundary.
description: Generate SQL migration files from a described schema
change (add column, new table, index, backfill). Use when the user
says "write a migration", "alter the table", or "add a column".
NOT for writing application queries or explaining existing schema.The negative clause is what stops over-triggering, and it maps straight to your negative eval scenarios. When skill-creator reports the skill firing on "explain this schema," you already know the fix lives in this clause.
Pattern 2: progressive disclosure of context
The body of SKILL.md loads into context the moment the skill is selected, so every word there is a tax on every invocation. The pattern is to keep the body short — the steps and the rules — and move long reference material (API docs, style guides, example libraries) into separate files the body tells Claude to read only when needed. This keeps the common path cheap and lets the rare path pull in depth on demand.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Skill selected"] --> B["Load SKILL.md body"]
B --> C{"Task needs deep ref?"}
C -->|No| D["Act from body alone"]
C -->|Yes| E["Read reference/api.md"]
E --> F["Act with full detail"]
D --> G["Output"]
F --> GThis structure also makes evals cleaner. If a scenario fails because Claude lacked a detail, you know whether the body should have included it or whether the body failed to point at the reference file — two different edits, both localized.
Pattern 3: push determinism into scripts
Prose instructions are interpreted; scripts are executed. Anything that must be byte-exact — a file naming convention, a date format, a validation rule — should live in a bundled script the skill invokes, not in a sentence the model is asked to follow precisely. Models are excellent at judgment and unreliable at mechanical exactness, so this split plays to the strengths of each.
# bundled in the skill folder as scripts/name_migration.py
import sys, datetime
slug = sys.argv[1].lower().replace(' ', '_')
ts = datetime.datetime.utcnow().strftime('%Y%m%d%H%M%S')
print(f"{ts}_{slug}.sql")The body then says "run scripts/name_migration.py with the change summary to get the filename" instead of describing a timestamp format Claude might render three different ways across runs. In eval terms, this is how you turn a flaky rubric line into a deterministic one — and flaky lines are exactly what variance analysis flags.
Pattern 4: write instructions that a rubric can grade
A skill is easy to refine when its instructions map one-to-one to rubric lines. Vague guidance like "make the output clean" cannot be graded and cannot be improved, because no eval can tell you whether it worked. Concrete, checkable instructions — "group changes under exactly three headings," "never include commits tagged chore" — each become a rubric line that passes or fails. Structure the body as a list of such testable assertions and your eval set almost writes itself.
This is a discipline more than a trick: every time you add a rule to the body, ask "what rubric line would verify this?" If you can't write one, the rule is too fuzzy to be useful and probably too fuzzy for Claude to follow consistently.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Pattern 5: one responsibility per skill
The strongest structural lever is scope. A skill that does one thing has a description that routes cleanly, a body short enough to keep in context, and an eval set that is small and sharp. A skill that does five loosely related things has a description that collides with other skills, a bloated body, and an eval set where one weak area drags down an unrelated strong one. When a skill's evals are persistently muddy, the fix is often to split it.
| Concern | Anti-pattern | Pattern that passes evals |
|---|---|---|
| Description | Vague summary | User phrasings + negative boundary |
| Context | Everything in the body | Short body, on-demand reference files |
| Exactness | Format described in prose | Deterministic bundled script |
| Instructions | "Make it clean" | Checkable, one-per-rubric-line rules |
| Scope | One mega-skill | One responsibility per skill |
Common pitfalls
- Description-as-summary: a poetic description that never uses real user words routes poorly and tanks trigger accuracy.
- Stuffing reference docs into the body: it inflates context on every call and buries the actual steps.
- Asking the model to be mechanically exact: formats and conventions belong in scripts; expecting prose to be byte-perfect creates variance.
- Ungradeable instructions: if you can't write a rubric line for a rule, Claude can't reliably follow it either.
- Scope creep: bolting extra jobs onto a working skill is how you turn green evals red across the board.
Restructure a skill in five steps
- Rewrite the description with user phrasings and an explicit "NOT for" boundary.
- Trim the body to steps and rules; move long reference material into separate files it reads on demand.
- Move every must-be-exact format or convention into a bundled script.
- Rephrase each body rule as a checkable assertion and mirror it as a rubric line.
- If evals stay muddy, split the skill so each one has a single responsibility, and re-run.
Frequently asked questions
How long should a SKILL.md body be?
Long enough to hold the steps and rules, short enough that loading it on every invocation isn't wasteful. Push anything reference-heavy into separate files the body points at, so the common path stays lean.
When should logic live in a script instead of instructions?
Whenever exactness matters — naming conventions, date formats, validation. Models are reliable at judgment and unreliable at mechanical precision, so deterministic work belongs in bundled scripts the skill invokes.
How do I know a skill is doing too much?
When its description collides with other skills, its body is too long to scan, or its eval set has unrelated areas that move together. Those are signals to split it into single-responsibility skills.
Bringing agentic AI to your phone lines
CallSphere applies these structural patterns to voice and chat agents — scoped, tool-backed assistants that answer every call and message, act mid-conversation, and book work 24/7. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.