Debugging Agent Skills: Loops, Wrong Tools, Bad Args
Diagnose and fix the three Agent Skill failure modes on Claude — loops, wrong tool calls, and hallucinated arguments — with concrete traces and edits.
The first time a Skill you wrote for Claude Code spins in a loop — reading the same file, calling the same tool, never converging — you learn something uncomfortable: a Skill is a program, and like any program it can have bugs. But the bugs don't show up as stack traces. They show up as behavior. Claude keeps retrying. It calls a tool with a malformed argument. It invents a flag that doesn't exist. Debugging these failures is a different discipline from debugging code, and most engineers approach it backwards.
This post is a practical guide to the three failure modes you will actually hit when refining a Skill with the skill-creator workflow: loops, wrong tool calls, and hallucinated arguments. For each, we cover how to recognize it in a trace, what usually causes it, and the specific change to your SKILL.md that fixes it.
Key takeaways
- Most Skill failures trace back to ambiguous instructions or under-specified tool contracts, not model weakness.
- Loops almost always mean the Skill lacks a clear stop condition or success signal.
- Wrong tool calls usually come from overlapping tool descriptions — fix the descriptions, not the prompt.
- Hallucinated arguments are a schema problem: tighten the JSON schema and add one worked example.
- Always reproduce a failure on a fixed transcript before changing anything, so you can confirm the fix.
An Agent Skill is a folder of instructions, scripts, and resources that Claude loads dynamically when a task matches the Skill's description, extending what the agent can do without retraining the model. Because the Skill is just text and files, every failure is debuggable by reading the transcript and editing the source.
How do you read an agent trace to find the bug?
Start by getting the full transcript: the system prompt, the loaded Skill content, every tool call with its arguments, and every tool result. In Claude Code you can inspect this directly; with the Agent SDK, log the raw message list. Do not summarize it. Read the literal token stream, because the bug is in something specific that was said or returned, and summaries hide it.
Look for the turn where the run first goes wrong — not where it visibly fails. A loop that becomes obvious at turn 12 usually started at turn 3, when the model formed a wrong belief and never got contradicting evidence. The diagnostic question at each suspect turn is: given exactly what Claude could see at this point, was its action reasonable? If yes, the bug is upstream — in the context it was given. If no, the bug is in the instruction that should have steered it.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Run fails or stalls"] --> B["Capture full transcript"]
B --> C{"Find first wrong turn"}
C --> D{"Was the action reasonable\ngiven visible context?"}
D -->|Yes| E["Bug is upstream:\nfix context or tool result"]
D -->|No| F["Bug is local:\nfix SKILL.md instruction"]
E --> G["Edit, replay same transcript"]
F --> G
G --> H{"Failure gone?"}
H -->|No| C
H -->|Yes| I["Lock with a regression eval"]
Why does my Skill get stuck in a loop?
Loops have one root cause in three disguises: the agent has no unambiguous signal that it is done. The disguises are (1) no explicit success criterion, so Claude keeps polishing; (2) a retry instruction with no retry budget, so a failing tool gets called forever; and (3) two steps that undo each other, so the agent oscillates.
The fix for all three is to make termination explicit. Give the Skill a definition of done it can check, and cap retries with a hard number. Concretely, add a block like this to your SKILL.md:
## Stopping rules
- You are DONE when: tests pass AND the diff touches only files listed in scope.md.
- If a tool call fails, retry at most twice. On the 3rd failure, stop and
report the exact error plus the command you ran. Do NOT try alternative tools.
- Never run the same shell command twice in a row with identical arguments.
If you would, you are looping — stop and explain what you are missing.
That last rule is deceptively powerful. By naming the loop pattern and instructing the model to treat it as a signal, you convert an infinite loop into a useful error message. The model is good at noticing "I am about to repeat myself" once you tell it that repetition is meaningful.
Why is Claude calling the wrong tool?
When an agent reaches for grep where you wanted a semantic search, or edits a file when it should have run a script, the instinct is to add a paragraph of prose telling it which tool to prefer. That rarely works, because the model selects tools primarily from their descriptions, not from buried prose. If two tools have descriptions that both plausibly fit the task, selection becomes a coin flip the prompt can't reliably override.
The durable fix is to make tool descriptions mutually exclusive. Each description should state exactly when to use it and, critically, when not to. Compare a vague pair against a disambiguated pair:
// Before — overlapping, ambiguous
{ "name": "search_code", "description": "Search the codebase" }
{ "name": "read_file", "description": "Read a file" }
// After — mutually exclusive, with negative guidance
{ "name": "search_code",
"description": "Find WHERE a symbol or string lives across the repo when you do
NOT yet know the file path. Returns file:line matches, not full contents." }
{ "name": "read_file",
"description": "Read the full contents of ONE file whose exact path you already
know. Do not use to search; if you lack the path, call search_code first." }
After tightening descriptions, re-run the failing transcript. If selection is still wrong, the next lever is ordering: put the tool you want preferred earlier in the tool list and reference it by name in the Skill's procedure, e.g. "Step 1: locate the handler with search_code."
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Why does it hallucinate arguments?
Hallucinated arguments — an invented parameter, a date in the wrong format, a flag the CLI doesn't support — are a schema and example problem. Models fill gaps with plausible-looking values. The cure is to leave no gaps: make the JSON schema strict, enumerate allowed values, and show one fully-worked call so the model has a concrete pattern to copy rather than invent.
- Use
enumfor any field with a fixed value set, and the model can't drift outside it. - Mark every required field
requiredand forbid extras withadditionalProperties: falseso unknown keys are rejected loudly instead of silently passed. - Put an exact example invocation in the tool description:
e.g. {"status":"open","limit":20}. One example beats three paragraphs of rules.
Common pitfalls
- Changing two things at once. You edit the instructions and the tool schema in one pass, the failure goes away, and you don't know which fix mattered. Change one variable, replay, then change the next.
- Debugging on a live run. Live runs are non-deterministic, so you can't tell if your fix worked or you got lucky. Freeze the failing transcript and replay against it.
- Patching symptoms with more prose. Adding "please don't loop" to a 2,000-word Skill dilutes every instruction. Find the missing stop condition or the ambiguous tool instead.
- Ignoring the tool result. Half of "model" bugs are actually a tool returning an error string the model treats as data. Read what the tool returned, not just what the model did.
- No regression net. You fix a loop, then a later edit reintroduces it. Capture each fixed transcript as an eval case so the bug can't silently return.
Debug a Skill failure in 6 steps
- Reproduce the failure and save the complete transcript to a file.
- Scan forward to the first turn where the agent's belief or action first went wrong.
- Ask whether that action was reasonable given only the context visible at that turn.
- Classify it: missing stop condition (loop), ambiguous tool (wrong call), or loose schema (bad args).
- Make exactly one targeted edit — a stopping rule, a disambiguated description, or a stricter schema with an example.
- Replay the same transcript, confirm the fix, and promote the case into your regression eval set.
Quick reference: failure mode to fix
| Symptom | Likely cause | Fix |
|---|---|---|
| Repeats same action | No stop condition | Explicit "done when" + retry cap |
| Oscillates between steps | Steps undo each other | Order steps; forbid back-tracking |
| Picks wrong tool | Overlapping descriptions | Mutually exclusive descriptions |
| Invents a parameter | Loose schema, no example | enum + required + worked example |
| Bad date/format value | Unspecified format | State format in schema, show one |
Frequently asked questions
How do I tell a model bug from an instruction bug?
Ask if the action was reasonable given the visible context. If yes, your instructions or tool results were the problem; the model did its best with bad inputs. Genuine model errors are rarer than they feel.
Should I lower temperature to stop hallucinated args?
It can reduce frequency but doesn't fix the root cause. A strict schema with enum and additionalProperties: false structurally prevents the invalid call, which is more reliable than nudging sampling.
My loop only happens sometimes — how do I debug it?
Run the same prompt several times, save every transcript, and diff the runs that loop against the ones that don't. The divergence point reveals the fragile instruction or the tool result that flips behavior.
Does a bigger model make these go away?
A stronger model like Opus 4.8 tolerates ambiguity better, but it won't invent a stop condition you forgot to write. Fixing the Skill makes every model on it more reliable and cheaper.
Bringing agentic AI to your phone lines
CallSphere takes the same debugging discipline — clear stop conditions, tight tool contracts, replayable traces — and applies it to voice and chat agents that handle every call and message, use tools mid-conversation, and book work around the clock. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.