Reusable patterns for structuring Claude Agent Skills

After you have tested a handful of Agent Skills with skill-creator, the failures start to rhyme. The same structural mistakes show up across unrelated skills, and the same structural choices keep evals green. This post is about those reusable patterns — how to shape the description, the body, the bundled scripts, and the way context loads, so that a skill is both effective and measurable. A skill that is hard to evaluate is usually a skill that is badly structured, and fixing the structure fixes both problems at once.

None of this is theoretical. Each pattern below maps directly to a metric skill-creator reports, which is the point: structure the skill so that when something breaks, the eval tells you exactly where.

Key takeaways

The description is a router, not a summary: write it with the words users say and the boundaries of what it does NOT cover.
Progressive disclosure keeps context lean: load only the body up front, push reference material into files Claude reads on demand.
Push determinism into scripts: anything that must be exact belongs in a bundled script, not in prose the model interprets.
Make every instruction testable: phrase steps so a rubric line can grade them as pass/fail.
One responsibility per skill: narrow skills trigger cleanly and score sharply; broad ones blur both.

Pattern 1: the description as a router

An Agent Skill is a folder of instructions and resources that Claude loads dynamically when its description matches the task. That makes the description the single most leveraged text in the whole skill — it is the router that decides whether anything else even runs. The pattern that survives evals has three parts: what the skill does, the concrete phrasings users employ, and an explicit negative boundary.

description: Generate SQL migration files from a described schema
  change (add column, new table, index, backfill). Use when the user
  says "write a migration", "alter the table", or "add a column".
  NOT for writing application queries or explaining existing schema.

The negative clause is what stops over-triggering, and it maps straight to your negative eval scenarios. When skill-creator reports the skill firing on "explain this schema," you already know the fix lives in this clause.

Pattern 2: progressive disclosure of context

The body of SKILL.md loads into context the moment the skill is selected, so every word there is a tax on every invocation. The pattern is to keep the body short — the steps and the rules — and move long reference material (API docs, style guides, example libraries) into separate files the body tells Claude to read only when needed. This keeps the common path cheap and lets the rare path pull in depth on demand.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Skill selected"] --> B["Load SKILL.md body"]
  B --> C{"Task needs deep ref?"}
  C -->|No| D["Act from body alone"]
  C -->|Yes| E["Read reference/api.md"]
  E --> F["Act with full detail"]
  D --> G["Output"]
  F --> G

This structure also makes evals cleaner. If a scenario fails because Claude lacked a detail, you know whether the body should have included it or whether the body failed to point at the reference file — two different edits, both localized.

Pattern 3: push determinism into scripts

Prose instructions are interpreted; scripts are executed. Anything that must be byte-exact — a file naming convention, a date format, a validation rule — should live in a bundled script the skill invokes, not in a sentence the model is asked to follow precisely. Models are excellent at judgment and unreliable at mechanical exactness, so this split plays to the strengths of each.

# bundled in the skill folder as scripts/name_migration.py
import sys, datetime
slug = sys.argv[1].lower().replace(' ', '_')
ts = datetime.datetime.utcnow().strftime('%Y%m%d%H%M%S')
print(f"{ts}_{slug}.sql")

The body then says "run scripts/name_migration.py with the change summary to get the filename" instead of describing a timestamp format Claude might render three different ways across runs. In eval terms, this is how you turn a flaky rubric line into a deterministic one — and flaky lines are exactly what variance analysis flags.

Pattern 4: write instructions that a rubric can grade

A skill is easy to refine when its instructions map one-to-one to rubric lines. Vague guidance like "make the output clean" cannot be graded and cannot be improved, because no eval can tell you whether it worked. Concrete, checkable instructions — "group changes under exactly three headings," "never include commits tagged chore" — each become a rubric line that passes or fails. Structure the body as a list of such testable assertions and your eval set almost writes itself.

This is a discipline more than a trick: every time you add a rule to the body, ask "what rubric line would verify this?" If you can't write one, the rule is too fuzzy to be useful and probably too fuzzy for Claude to follow consistently.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Pattern 5: one responsibility per skill

The strongest structural lever is scope. A skill that does one thing has a description that routes cleanly, a body short enough to keep in context, and an eval set that is small and sharp. A skill that does five loosely related things has a description that collides with other skills, a bloated body, and an eval set where one weak area drags down an unrelated strong one. When a skill's evals are persistently muddy, the fix is often to split it.

Concern	Anti-pattern	Pattern that passes evals
Description	Vague summary	User phrasings + negative boundary
Context	Everything in the body	Short body, on-demand reference files
Exactness	Format described in prose	Deterministic bundled script
Instructions	"Make it clean"	Checkable, one-per-rubric-line rules
Scope	One mega-skill	One responsibility per skill

Common pitfalls

Description-as-summary: a poetic description that never uses real user words routes poorly and tanks trigger accuracy.
Stuffing reference docs into the body: it inflates context on every call and buries the actual steps.
Asking the model to be mechanically exact: formats and conventions belong in scripts; expecting prose to be byte-perfect creates variance.
Ungradeable instructions: if you can't write a rubric line for a rule, Claude can't reliably follow it either.
Scope creep: bolting extra jobs onto a working skill is how you turn green evals red across the board.

Restructure a skill in five steps

Rewrite the description with user phrasings and an explicit "NOT for" boundary.
Trim the body to steps and rules; move long reference material into separate files it reads on demand.
Move every must-be-exact format or convention into a bundled script.
Rephrase each body rule as a checkable assertion and mirror it as a rubric line.
If evals stay muddy, split the skill so each one has a single responsibility, and re-run.

Frequently asked questions

How long should a SKILL.md body be?

Long enough to hold the steps and rules, short enough that loading it on every invocation isn't wasteful. Push anything reference-heavy into separate files the body points at, so the common path stays lean.

When should logic live in a script instead of instructions?

Whenever exactness matters — naming conventions, date formats, validation. Models are reliable at judgment and unreliable at mechanical precision, so deterministic work belongs in bundled scripts the skill invokes.

How do I know a skill is doing too much?

When its description collides with other skills, its body is too long to scan, or its eval set has unrelated areas that move together. Those are signals to split it into single-responsibility skills.

Bringing agentic AI to your phone lines

CallSphere applies these structural patterns to voice and chat agents — scoped, tool-backed assistants that answer every call and message, act mid-conversation, and book work 24/7. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Reusable patterns for structuring Claude Agent Skills

Key takeaways

Pattern 1: the description as a router

Pattern 2: progressive disclosure of context

Pattern 3: push determinism into scripts

Pattern 4: write instructions that a rubric can grade

Pattern 5: one responsibility per skill

Common pitfalls

Restructure a skill in five steps

Frequently asked questions

How long should a SKILL.md body be?

When should logic live in a script instead of instructions?

How do I know a skill is doing too much?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild