---
title: "Reusable patterns for structuring Claude Agent Skills"
description: "Code-level patterns for Agent Skills: description-as-router, progressive disclosure, deterministic scripts, and gradeable instructions tested with skill-creator."
canonical: https://callsphere.ai/blog/reusable-patterns-for-structuring-claude-agent-skills
category: "Agentic AI"
tags: ["agentic ai", "claude", "agent skills", "skill-creator", "prompt engineering", "anthropic", "patterns"]
author: "CallSphere Team"
published: 2026-02-28T08:46:22.000Z
updated: 2026-06-07T01:28:23.231Z
---

# Reusable patterns for structuring Claude Agent Skills

> Code-level patterns for Agent Skills: description-as-router, progressive disclosure, deterministic scripts, and gradeable instructions tested with skill-creator.

After you have tested a handful of Agent Skills with `skill-creator`, the failures start to rhyme. The same structural mistakes show up across unrelated skills, and the same structural choices keep evals green. This post is about those reusable patterns — how to shape the description, the body, the bundled scripts, and the way context loads, so that a skill is both effective and *measurable*. A skill that is hard to evaluate is usually a skill that is badly structured, and fixing the structure fixes both problems at once.

None of this is theoretical. Each pattern below maps directly to a metric skill-creator reports, which is the point: structure the skill so that when something breaks, the eval tells you exactly where.

## Key takeaways

- **The description is a router, not a summary**: write it with the words users say and the boundaries of what it does NOT cover.
- **Progressive disclosure keeps context lean**: load only the body up front, push reference material into files Claude reads on demand.
- **Push determinism into scripts**: anything that must be exact belongs in a bundled script, not in prose the model interprets.
- **Make every instruction testable**: phrase steps so a rubric line can grade them as pass/fail.
- **One responsibility per skill**: narrow skills trigger cleanly and score sharply; broad ones blur both.

## Pattern 1: the description as a router

An Agent Skill is a folder of instructions and resources that Claude loads dynamically when its description matches the task. That makes the description the single most leveraged text in the whole skill — it is the router that decides whether anything else even runs. The pattern that survives evals has three parts: what the skill does, the concrete phrasings users employ, and an explicit negative boundary.

```
description: Generate SQL migration files from a described schema
  change (add column, new table, index, backfill). Use when the user
  says "write a migration", "alter the table", or "add a column".
  NOT for writing application queries or explaining existing schema.
```

The negative clause is what stops over-triggering, and it maps straight to your negative eval scenarios. When skill-creator reports the skill firing on "explain this schema," you already know the fix lives in this clause.

## Pattern 2: progressive disclosure of context

The body of `SKILL.md` loads into context the moment the skill is selected, so every word there is a tax on every invocation. The pattern is to keep the body short — the steps and the rules — and move long reference material (API docs, style guides, example libraries) into separate files the body tells Claude to read only when needed. This keeps the common path cheap and lets the rare path pull in depth on demand.

```mermaid
flowchart TD
  A["Skill selected"] --> B["Load SKILL.md body"]
  B --> C{"Task needs deep ref?"}
  C -->|No| D["Act from body alone"]
  C -->|Yes| E["Read reference/api.md"]
  E --> F["Act with full detail"]
  D --> G["Output"]
  F --> G
```

This structure also makes evals cleaner. If a scenario fails because Claude lacked a detail, you know whether the body should have included it or whether the body failed to point at the reference file — two different edits, both localized.

## Pattern 3: push determinism into scripts

Prose instructions are interpreted; scripts are executed. Anything that must be byte-exact — a file naming convention, a date format, a validation rule — should live in a bundled script the skill invokes, not in a sentence the model is asked to follow precisely. Models are excellent at judgment and unreliable at mechanical exactness, so this split plays to the strengths of each.

```
# bundled in the skill folder as scripts/name_migration.py
import sys, datetime
slug = sys.argv[1].lower().replace(' ', '_')
ts = datetime.datetime.utcnow().strftime('%Y%m%d%H%M%S')
print(f"{ts}_{slug}.sql")
```

The body then says "run `scripts/name_migration.py` with the change summary to get the filename" instead of describing a timestamp format Claude might render three different ways across runs. In eval terms, this is how you turn a flaky rubric line into a deterministic one — and flaky lines are exactly what variance analysis flags.

## Pattern 4: write instructions that a rubric can grade

A skill is easy to refine when its instructions map one-to-one to rubric lines. Vague guidance like "make the output clean" cannot be graded and cannot be improved, because no eval can tell you whether it worked. Concrete, checkable instructions — "group changes under exactly three headings," "never include commits tagged chore" — each become a rubric line that passes or fails. Structure the body as a list of such testable assertions and your eval set almost writes itself.

This is a discipline more than a trick: every time you add a rule to the body, ask "what rubric line would verify this?" If you can't write one, the rule is too fuzzy to be useful and probably too fuzzy for Claude to follow consistently.

## Pattern 5: one responsibility per skill

The strongest structural lever is scope. A skill that does one thing has a description that routes cleanly, a body short enough to keep in context, and an eval set that is small and sharp. A skill that does five loosely related things has a description that collides with other skills, a bloated body, and an eval set where one weak area drags down an unrelated strong one. When a skill's evals are persistently muddy, the fix is often to split it.

| Concern | Anti-pattern | Pattern that passes evals |
| --- | --- | --- |
| Description | Vague summary | User phrasings + negative boundary |
| Context | Everything in the body | Short body, on-demand reference files |
| Exactness | Format described in prose | Deterministic bundled script |
| Instructions | "Make it clean" | Checkable, one-per-rubric-line rules |
| Scope | One mega-skill | One responsibility per skill |

## Common pitfalls

- **Description-as-summary**: a poetic description that never uses real user words routes poorly and tanks trigger accuracy.
- **Stuffing reference docs into the body**: it inflates context on every call and buries the actual steps.
- **Asking the model to be mechanically exact**: formats and conventions belong in scripts; expecting prose to be byte-perfect creates variance.
- **Ungradeable instructions**: if you can't write a rubric line for a rule, Claude can't reliably follow it either.
- **Scope creep**: bolting extra jobs onto a working skill is how you turn green evals red across the board.

## Restructure a skill in five steps

1. Rewrite the description with user phrasings and an explicit "NOT for" boundary.
2. Trim the body to steps and rules; move long reference material into separate files it reads on demand.
3. Move every must-be-exact format or convention into a bundled script.
4. Rephrase each body rule as a checkable assertion and mirror it as a rubric line.
5. If evals stay muddy, split the skill so each one has a single responsibility, and re-run.

## Frequently asked questions

### How long should a SKILL.md body be?

Long enough to hold the steps and rules, short enough that loading it on every invocation isn't wasteful. Push anything reference-heavy into separate files the body points at, so the common path stays lean.

### When should logic live in a script instead of instructions?

Whenever exactness matters — naming conventions, date formats, validation. Models are reliable at judgment and unreliable at mechanical precision, so deterministic work belongs in bundled scripts the skill invokes.

### How do I know a skill is doing too much?

When its description collides with other skills, its body is too long to scan, or its eval set has unrelated areas that move together. Those are signals to split it into single-responsibility skills.

## Bringing agentic AI to your phone lines

CallSphere applies these structural patterns to **voice and chat** agents — scoped, tool-backed assistants that answer every call and message, act mid-conversation, and book work 24/7. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/reusable-patterns-for-structuring-claude-agent-skills
