Debugging Agent Skills: Loops, Wrong Tools, Bad Args

The first time a Skill you wrote for Claude Code spins in a loop — reading the same file, calling the same tool, never converging — you learn something uncomfortable: a Skill is a program, and like any program it can have bugs. But the bugs don't show up as stack traces. They show up as behavior. Claude keeps retrying. It calls a tool with a malformed argument. It invents a flag that doesn't exist. Debugging these failures is a different discipline from debugging code, and most engineers approach it backwards.

This post is a practical guide to the three failure modes you will actually hit when refining a Skill with the skill-creator workflow: loops, wrong tool calls, and hallucinated arguments. For each, we cover how to recognize it in a trace, what usually causes it, and the specific change to your SKILL.md that fixes it.

Key takeaways

Most Skill failures trace back to ambiguous instructions or under-specified tool contracts, not model weakness.
Loops almost always mean the Skill lacks a clear stop condition or success signal.
Wrong tool calls usually come from overlapping tool descriptions — fix the descriptions, not the prompt.
Hallucinated arguments are a schema problem: tighten the JSON schema and add one worked example.
Always reproduce a failure on a fixed transcript before changing anything, so you can confirm the fix.

An Agent Skill is a folder of instructions, scripts, and resources that Claude loads dynamically when a task matches the Skill's description, extending what the agent can do without retraining the model. Because the Skill is just text and files, every failure is debuggable by reading the transcript and editing the source.

How do you read an agent trace to find the bug?

Start by getting the full transcript: the system prompt, the loaded Skill content, every tool call with its arguments, and every tool result. In Claude Code you can inspect this directly; with the Agent SDK, log the raw message list. Do not summarize it. Read the literal token stream, because the bug is in something specific that was said or returned, and summaries hide it.

Look for the turn where the run first goes wrong — not where it visibly fails. A loop that becomes obvious at turn 12 usually started at turn 3, when the model formed a wrong belief and never got contradicting evidence. The diagnostic question at each suspect turn is: given exactly what Claude could see at this point, was its action reasonable? If yes, the bug is upstream — in the context it was given. If no, the bug is in the instruction that should have steered it.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart TD
  A["Run fails or stalls"] --> B["Capture full transcript"]
  B --> C{"Find first wrong turn"}
  C --> D{"Was the action reasonable\ngiven visible context?"}
  D -->|Yes| E["Bug is upstream:\nfix context or tool result"]
  D -->|No| F["Bug is local:\nfix SKILL.md instruction"]
  E --> G["Edit, replay same transcript"]
  F --> G
  G --> H{"Failure gone?"}
  H -->|No| C
  H -->|Yes| I["Lock with a regression eval"]

Why does my Skill get stuck in a loop?

Loops have one root cause in three disguises: the agent has no unambiguous signal that it is done. The disguises are (1) no explicit success criterion, so Claude keeps polishing; (2) a retry instruction with no retry budget, so a failing tool gets called forever; and (3) two steps that undo each other, so the agent oscillates.

The fix for all three is to make termination explicit. Give the Skill a definition of done it can check, and cap retries with a hard number. Concretely, add a block like this to your SKILL.md:

## Stopping rules
- You are DONE when: tests pass AND the diff touches only files listed in scope.md.
- If a tool call fails, retry at most twice. On the 3rd failure, stop and
  report the exact error plus the command you ran. Do NOT try alternative tools.
- Never run the same shell command twice in a row with identical arguments.
  If you would, you are looping — stop and explain what you are missing.

That last rule is deceptively powerful. By naming the loop pattern and instructing the model to treat it as a signal, you convert an infinite loop into a useful error message. The model is good at noticing "I am about to repeat myself" once you tell it that repetition is meaningful.

Why is Claude calling the wrong tool?

When an agent reaches for grep where you wanted a semantic search, or edits a file when it should have run a script, the instinct is to add a paragraph of prose telling it which tool to prefer. That rarely works, because the model selects tools primarily from their descriptions, not from buried prose. If two tools have descriptions that both plausibly fit the task, selection becomes a coin flip the prompt can't reliably override.

The durable fix is to make tool descriptions mutually exclusive. Each description should state exactly when to use it and, critically, when not to. Compare a vague pair against a disambiguated pair:

// Before — overlapping, ambiguous
{ "name": "search_code",  "description": "Search the codebase" }
{ "name": "read_file",    "description": "Read a file" }

// After — mutually exclusive, with negative guidance
{ "name": "search_code",
  "description": "Find WHERE a symbol or string lives across the repo when you do
   NOT yet know the file path. Returns file:line matches, not full contents." }
{ "name": "read_file",
  "description": "Read the full contents of ONE file whose exact path you already
   know. Do not use to search; if you lack the path, call search_code first." }

After tightening descriptions, re-run the failing transcript. If selection is still wrong, the next lever is ordering: put the tool you want preferred earlier in the tool list and reference it by name in the Skill's procedure, e.g. "Step 1: locate the handler with search_code."

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Why does it hallucinate arguments?

Hallucinated arguments — an invented parameter, a date in the wrong format, a flag the CLI doesn't support — are a schema and example problem. Models fill gaps with plausible-looking values. The cure is to leave no gaps: make the JSON schema strict, enumerate allowed values, and show one fully-worked call so the model has a concrete pattern to copy rather than invent.

Use enum for any field with a fixed value set, and the model can't drift outside it.
Mark every required field required and forbid extras with additionalProperties: false so unknown keys are rejected loudly instead of silently passed.
Put an exact example invocation in the tool description: e.g. {"status":"open","limit":20}. One example beats three paragraphs of rules.

Common pitfalls

Changing two things at once. You edit the instructions and the tool schema in one pass, the failure goes away, and you don't know which fix mattered. Change one variable, replay, then change the next.
Debugging on a live run. Live runs are non-deterministic, so you can't tell if your fix worked or you got lucky. Freeze the failing transcript and replay against it.
Patching symptoms with more prose. Adding "please don't loop" to a 2,000-word Skill dilutes every instruction. Find the missing stop condition or the ambiguous tool instead.
Ignoring the tool result. Half of "model" bugs are actually a tool returning an error string the model treats as data. Read what the tool returned, not just what the model did.
No regression net. You fix a loop, then a later edit reintroduces it. Capture each fixed transcript as an eval case so the bug can't silently return.

Debug a Skill failure in 6 steps

Reproduce the failure and save the complete transcript to a file.
Scan forward to the first turn where the agent's belief or action first went wrong.
Ask whether that action was reasonable given only the context visible at that turn.
Classify it: missing stop condition (loop), ambiguous tool (wrong call), or loose schema (bad args).
Make exactly one targeted edit — a stopping rule, a disambiguated description, or a stricter schema with an example.
Replay the same transcript, confirm the fix, and promote the case into your regression eval set.

Quick reference: failure mode to fix

Symptom	Likely cause	Fix
Repeats same action	No stop condition	Explicit "done when" + retry cap
Oscillates between steps	Steps undo each other	Order steps; forbid back-tracking
Picks wrong tool	Overlapping descriptions	Mutually exclusive descriptions
Invents a parameter	Loose schema, no example	enum + required + worked example
Bad date/format value	Unspecified format	State format in schema, show one

Frequently asked questions

How do I tell a model bug from an instruction bug?

Ask if the action was reasonable given the visible context. If yes, your instructions or tool results were the problem; the model did its best with bad inputs. Genuine model errors are rarer than they feel.

Should I lower temperature to stop hallucinated args?

It can reduce frequency but doesn't fix the root cause. A strict schema with enum and additionalProperties: false structurally prevents the invalid call, which is more reliable than nudging sampling.

My loop only happens sometimes — how do I debug it?

Run the same prompt several times, save every transcript, and diff the runs that loop against the ones that don't. The divergence point reveals the fragile instruction or the tool result that flips behavior.

Does a bigger model make these go away?

A stronger model like Opus 4.8 tolerates ambiguity better, but it won't invent a stop condition you forgot to write. Fixing the Skill makes every model on it more reliable and cheaper.

Bringing agentic AI to your phone lines

CallSphere takes the same debugging discipline — clear stop conditions, tight tool contracts, replayable traces — and applies it to voice and chat agents that handle every call and message, use tools mid-conversation, and book work around the clock. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Debugging Agent Skills: Loops, Wrong Tools, Bad Args

Key takeaways

How do you read an agent trace to find the bug?

Why does my Skill get stuck in a loop?

Why is Claude calling the wrong tool?

Why does it hallucinate arguments?

Common pitfalls

Debug a Skill failure in 6 steps

Quick reference: failure mode to fix

Frequently asked questions

How do I tell a model bug from an instruction bug?

Should I lower temperature to stop hallucinated args?

My loop only happens sometimes — how do I debug it?

Does a bigger model make these go away?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild