Reusable Patterns for Building Claude Agent Skills
Code-level patterns for Claude Agent Skills: checkpointed procedures, deterministic cores, thin bodies, clean composition, and skill testing that scales.
After you've shipped a handful of Agent Skills, you stop thinking about them one at a time and start noticing patterns. The good skills share a shape; the brittle ones share a set of mistakes. This post is a field guide to those reusable patterns — concrete, code-level ways to structure the prompt body, organize tools, and shape context so a skill behaves the same on its hundredth invocation as it did on its first. None of this is exotic; it's the accumulated discipline that separates a demo from something you'd let run unattended.
I'll frame each pattern as a problem and a structure, because that's how you'll actually reach for them. When a skill misbehaves, you'll recognize the symptom and know which pattern fixes it.
The procedure-with-checkpoints pattern
The most common skill failure is the model drifting off-procedure on long tasks — skipping a step, reordering work, declaring victory early. The fix is to write the body as an explicit numbered procedure and insert checkpoints the model must satisfy before moving on. Instead of "clean the data and produce a report," write: "1. Load the file. 2. Validate every row against the schema; if any row fails, stop and list the failures before continuing. 3. Only once all rows pass, aggregate. 4. Render the report."
The checkpoints — "stop and list," "only once all rows pass" — turn vague intent into gates. They give the model a clear sense of done-ness at each stage and make its behavior auditable. When you read a trace later, you can see exactly which checkpoint it was at when something went wrong. This pattern alone removes most long-task flakiness.
The thin-body, deep-reference pattern
Skills accumulate knowledge, and the temptation is to pour all of it into SKILL.md. Resist. Keep the body thin — the core procedure and the decisions the model makes every time — and push exhaustive detail into reference files the body points to conditionally. "For edge cases in date parsing, read reference/dates.md" loads that detail only when the work hits a date edge case.
The reason is context economy. Everything in the body competes with the task for the model's attention on every single run. A 4,000-word body spends that budget whether or not today's task needs the rare branches. Splitting it means common cases stay fast and lean while rare cases still have full guidance available on demand. Structure your folder so the body is a map and the references are the territory.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Task enters skill"] --> B["Run numbered procedure"]
B --> C{"Hit an edge case?"}
C -->|No| D["Continue main path"]
C -->|Yes| E["Load matching reference file"]
E --> D
D --> F{"Checkpoint passed?"}
F -->|No| G["Stop & report failure"]
F -->|Yes| H["Produce structured output"]The deterministic-core pattern
For anything the model tends to get subtly wrong — counting, exact formatting, schema validation, math — don't ask it to do the work in prose. Wrap that work in a script and have the skill call it. The model orchestrates; the code computes. This is the single highest-leverage reliability pattern, because it converts a probabilistic step into a deterministic one.
A practical heuristic: if you can write a unit test for a step, it probably belongs in a script. Validating that a JSON payload matches a schema, normalizing currency strings, deduplicating a list — these have right answers, so encode them. Reserve the model for steps where judgment is the point: what to include, how to phrase it, when to ask the user. A well-structured skill reads like a thin layer of judgment wrapped around a few small, tested functions.
Structuring prompts inside the body
The body is itself a prompt, and the same prompt-engineering craft applies. Write in the imperative, address the model directly, and prefer specificity over completeness — one sharp example outperforms three paragraphs of abstract guidance. When you want a particular output shape, show it: include a short example of the exact format you expect, fenced so the model can pattern-match against it.
Be explicit about boundaries, too. State what the skill should not do as clearly as what it should: "Do not invent version numbers; if the version is unknown, ask." Negative instructions close off the failure modes you've actually seen in testing. Treat each one as a scar — a rule earned from a run that went wrong — and your bodies will get tougher over time without getting longer-winded.
Composing skills without coupling
Reusable skills are self-contained and unaware of each other. A data-validation skill and a report-formatting skill should each work alone and also compose when both are relevant, without either referencing the other by name. Achieve this by keeping each skill's contract clean: it takes a well-defined input situation and produces a well-defined output, and it doesn't assume what ran before or after it.
When you do need two skills to cooperate on a pipeline, let the agent's reasoning be the glue rather than hardcoding a dependency. The validation skill produces clean structured data; the formatting skill happens to consume exactly that shape. The coupling lives in the data contract, not in cross-references. This keeps your library modular — you can delete, swap, or upgrade any one skill without auditing the rest.
The observable-skill pattern
A skill you can't see inside is a skill you can't trust at scale. The most reliable skills are written so their behavior is legible from a trace — you can read back exactly which step ran, which reference loaded, and which tool was called with what arguments. Achieve this by making each step in the body produce a visible artifact: a checkpoint that prints what it verified, a script that logs its inputs and outputs, a clear statement before each tool call about why it's being made.
This isn't busywork; it's how you debug an agent in production. When a release-notes skill produces a wrong changelog, an observable skill lets you pinpoint the failure in seconds — the parse script returned bad data, or a style rule was misread — instead of staring at a final output with no idea how it got there. Bake the legibility in from the start. Skills designed to be read after the fact are the ones you can actually operate, and the small upfront cost of structured, visible steps repays itself the first time something goes wrong at 2 a.m.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Versioning and testing skills like code
Skills are text and scripts, so treat them like any other code asset: keep them in version control, write down what each is for, and build a small evaluation set. A skill's eval set is just a handful of representative tasks plus the expected behavior — the right output, the right refusal, the right question back. Run them after every edit. Because skills are deterministic in structure (even if the model is probabilistic), regressions show up fast: a description tweak that breaks firing, a body edit that drops a checkpoint.
The discipline pays off most as a library grows. Without tests, every change risks silent breakage in skills you forgot existed. With even a thin eval per skill, you edit confidently and ship often. The teams that get real leverage from Skills are the ones that treat them as a tested, versioned part of the codebase — not as throwaway prompts.
Frequently asked questions
When should logic be a script versus prose in the body?
If a step has a right answer you could unit-test — counting, formatting, schema validation, math — put it in a script. Reserve the prose body for judgment: what to include, tone, when to ask the user. The script makes the deterministic part reliable; the body orchestrates around it.
How do I keep a skill from drifting on long tasks?
Use the procedure-with-checkpoints pattern: write the body as numbered steps with explicit gates the model must satisfy before continuing ("stop and list failures before aggregating"). Checkpoints give a clear sense of done-ness at each stage and make traces auditable.
How do skills compose without becoming tangled?
Keep each skill self-contained and unaware of the others. Let cooperation happen through clean data contracts — one skill's output is the shape another consumes — rather than hardcoded cross-references. That keeps the library modular and lets you upgrade any skill in isolation.
Do skills need their own tests?
Yes. Give each skill a small eval set of representative tasks plus expected behavior, kept in version control. Run it after every edit. It's the cheapest way to catch a description tweak that breaks firing or a body edit that drops a checkpoint.
Bringing agentic AI to your phone lines
CallSphere applies these structuring patterns to voice and chat — agents whose skills stay reliable and composable across thousands of live calls, using tools mid-conversation to book real work. See the patterns in action at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.