Prompt and context design for reliable Agent Skills

The most common reason a well-intentioned Agent Skill underperforms isn't a missing instruction — it's too many of them. Engineers pour every edge case, every reference doc, and every reminder into SKILL.md, and the result is a skill that is slower, more expensive, and paradoxically less reliable, because the signal that matters drowns in context that doesn't. Designing a skill's context is an exercise in subtraction as much as addition: deciding what earns a place in the prompt, what belongs in a file loaded on demand, and what should be left out entirely. This post is about making those calls deliberately and verifying them with skill-creator.

Context is a budget, not a free resource. Everything you load competes for the model's attention and adds latency and cost to every invocation. The skills that stay reliable across model versions are the ones whose context is curated, not accumulated.

Key takeaways

Context is a budget: every token in the body taxes every call, so put only what earns its place.
Lead with the task, not the caveats: the core instruction goes first; edge cases come after or in reference files.
Use progressive disclosure: keep the body lean and let Claude pull deep references only when a task needs them.
Examples beat adjectives: one concrete worked example steers behavior better than a paragraph of "be thorough."
Measure what you trim: use skill-creator to confirm that removing context didn't drop a rubric line.

Context is a budget, not a backpack

It's tempting to treat SKILL.md like a backpack — keep adding things in case they're useful. But every instruction loaded at selection time stays in context for the whole task, competing with the user's actual request for the model's attention. Past a point, more instructions reduce reliability: the model has to weigh a dozen rules of varying relevance on every step, and the important ones lose salience. The discipline is to ask of each line, "does this change behavior on a real scenario?" If you can't point to a scenario it fixes, it's probably noise.

This is why the testing loop and the context-design loop are the same loop. You don't trim by intuition; you trim and re-run the eval set, and you keep the cut only if no rubric line regressed. Subtraction, verified.

What goes in, what stays out

A useful sorting rule has three bins. The body holds the task definition, the must-always-apply rules, and the steps — the things needed on nearly every invocation. Reference files hold depth used occasionally: full API specs, long style guides, exhaustive example libraries. And some things stay out entirely — context the MCP server or a script should own, like credentials, or generic model knowledge Claude already has and doesn't need restated.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Candidate context"] --> B{"Needed on most calls?"}
  B -->|Yes| C["Put in SKILL.md body"]
  B -->|No| D{"Needed sometimes?"}
  D -->|Yes| E["Move to reference file"]
  D -->|No| F{"Owned elsewhere?"}
  F -->|Yes| G["Server / script holds it"]
  F -->|No| H["Leave it out"]

Run every candidate piece of context through this filter. Most things engineers reflexively put in the body actually belong in a reference file or don't belong at all, and moving them out is what keeps the common path fast and focused.

Order and emphasis inside the prompt

Where something sits in the body matters. The core task and the highest-priority rules belong near the top, stated plainly, because they should anchor everything that follows. Edge cases, exceptions, and "only if" branches belong lower, after the model already has the main frame. Burying the central instruction under three paragraphs of caveats is a reliable way to get a model that handles the edge cases beautifully and the common case poorly.

Emphasis is a tool, used sparingly. Marking a genuinely critical constraint — "never modify files outside the target directory" — as a hard rule works precisely because most lines aren't marked that way. If everything is emphasized, nothing is.

Show, don't tell: examples as context

Adjectives are weak instructions. "Write concise, professional release notes" leaves enormous room for interpretation; one short worked example showing exactly the format, tone, and level of detail you want collapses that ambiguity instantly. A single good example is often worth more context-for-context than a paragraph of description, and it's far easier to grade against — your rubric can simply check that the output matches the demonstrated shape.

# In SKILL.md, a worked example beats a description:

Example output for one feature:
  ## Features
  - Bulk export now supports CSV and Parquet. Large exports
    stream in the background; you'll get an email when ready.

Match this voice: user benefit first, mechanism second, no commit hashes.

That single example does the work of several abstract rules — and when skill-creator grades "user-facing wording" or "no commit hashes," the model has a concrete target to imitate rather than an adjective to interpret.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Measuring context decisions with skill-creator

Every context choice is a hypothesis you can test. Removed a paragraph you suspected was noise? Re-run the eval set; if all rubric lines hold, the cut was free and you've made the skill faster. Added an example to fix a tone miss? Re-run and watch that one rubric line climb. Because the harness samples each scenario several times, it also tells you whether a context change improved stability, not just the average — a trimmed, focused body often reduces variance even when the mean score barely moves.

Context type	Default location	Why
Task definition + core rules	SKILL.md body	Needed on nearly every call
Worked example	SKILL.md body	Steers behavior cheaply and gradeably
Full API / style spec	Reference file	Deep but only sometimes needed
Credentials / tokens	MCP server	Must stay out of context entirely
Generic model knowledge	Nowhere	Already known; restating wastes budget

Common pitfalls

The kitchen-sink body: dumping every edge case into SKILL.md dilutes the rules that matter and slows every call.
Caveats before the core task: leading with exceptions makes the model good at edge cases and bad at the common one.
Over-emphasis: marking everything critical makes nothing critical; reserve hard rules for the few that truly are.
Adjectives instead of examples: "be clear and professional" is unmeasurable; a worked example is concrete and gradeable.
Trimming without re-running: cutting context by feel can silently drop a behavior; always verify against the eval set.

Tune a skill's context in five steps

List every piece of context currently in the body and run each through the in/out filter.
Move occasionally-needed depth into reference files the body points at on demand.
Reorder the body so the core task and top rules come first, edge cases after.
Replace at least one vague instruction with a concrete worked example.
Re-run the skill-creator eval set with samples; keep changes that hold or improve scores and reduce variance.

Frequently asked questions

How do I decide what to put in SKILL.md versus a reference file?

Ask how often it's needed. Context required on nearly every call belongs in the body; depth used only sometimes belongs in a reference file the body reads on demand, keeping the common path lean.

Do longer skill prompts make a skill more reliable?

Usually the opposite past a point. Extra instructions compete for the model's attention and bury the rules that matter, so curated, focused context is more reliable than an exhaustive one. Trim and verify with evals.

Why prefer examples over descriptive instructions?

A concrete worked example removes ambiguity that adjectives leave open, and it's directly gradeable — your rubric can check the output matches the demonstrated shape rather than interpreting words like "concise."

Bringing agentic AI to your phone lines

CallSphere applies this same context discipline to voice and chat agents — lean, well-prompted assistants that answer every call and message, use tools mid-conversation, and book work 24/7. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Prompt and context design for reliable Agent Skills

Key takeaways

Context is a budget, not a backpack

What goes in, what stays out

Order and emphasis inside the prompt

Show, don't tell: examples as context

Measuring context decisions with skill-creator

Common pitfalls

Tune a skill's context in five steps

Frequently asked questions

How do I decide what to put in SKILL.md versus a reference file?

Do longer skill prompts make a skill more reliable?

Why prefer examples over descriptive instructions?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild