Securing Claude Skills: sandboxing, secrets, injection
Harden Claude Agent Skills and MCP tools: sandboxing, least privilege, secrets in the tool layer, and layered prompt-injection defense for 2026.
The capabilities that make Claude Skills powerful are the same ones that make them a security problem. A Skill can run scripts, an MCP server can read your database and call external APIs, and the agent decides at runtime which of those powers to invoke based on natural-language instructions it pulled from somewhere. Give an autonomous system the ability to execute code and touch real data, and you have built an attack surface that did not exist before. Hardening that surface is not optional once you move past a demo — it is the difference between a useful agent and an incident. This post lays out the concrete defenses that matter.
One definition to anchor the discussion. Prompt injection is an attack in which adversarial instructions, hidden inside content the agent reads — a web page, an email, a document, a tool result — hijack the agent into doing something its operator never intended. It is the SQL injection of the agent era: untrusted data crosses into the instruction channel, and the model, which cannot natively distinguish "data to process" from "commands to obey," follows the attacker's text. Every other defense in this post exists partly to contain the blast radius when injection succeeds.
Sandboxing: assume the code is hostile
If a Skill can execute scripts or an MCP tool can run shell commands, run that execution in a sandbox with no implicit access to the host. The sandbox should have no ambient credentials, a read-only or scratch filesystem, no network access except to an explicit allowlist, and a wall-clock and memory limit. Containerization or a microVM gives you this cheaply. The principle is simple: assume any code the agent runs could be attacker-controlled, because if injection succeeds, it is.
Network egress is the most overlooked control. An agent that can make arbitrary outbound requests can exfiltrate whatever it reads — your secrets, your customer data — to an attacker's endpoint, which is the payoff of many injection attacks. Default-deny egress and allowlist only the specific hosts a Skill legitimately needs. If a documentation Skill has no business calling the internet, it should be unable to.
Least privilege for tools and Skills
Every tool you expose to Claude is a capability you have granted. The discipline is to grant the minimum. A Skill that summarizes invoices needs read access to invoices and nothing else — not write, not delete, not access to the customer table. Scope each MCP server's credentials to exactly the operations its tools require, and split broad servers into narrow ones so a compromise of one cannot reach everything.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Separate read from write, and gate destructive actions. Reads can usually run autonomously; writes, deletes, payments, and external sends deserve a confirmation step or a human approval, especially when the inputs trace back to untrusted content. Design your Skills so the dangerous verbs require an explicit, logged decision rather than firing in the middle of an autonomous loop.
flowchart TD
A["Untrusted content enters context"] --> B{"Contains hidden instructions?"}
B -->|Maybe| C["Treat as DATA, never commands"]
C --> D{"Action requested?"}
D -->|Read only| E["Run in sandbox, allowlist egress"]
D -->|Write/delete/send| F["Require approval + log"]
E --> G["Scoped least-privilege creds"]
F --> G
G --> H["Audit trail of every tool call"]
Secrets: keep them out of the model's reach
The cardinal rule of secrets in agentic systems: the model should never see them. API keys, database passwords, and tokens belong in the tool layer — the MCP server holds the credential and uses it to make the call, returning only the result. Claude asks for "the customer record," the server authenticates with a key the model never receives. If a secret never enters the context window, no amount of prompt injection can leak it through the model.
Avoid putting credentials in Skill files, system prompts, or environment variables that the agent can echo. Inject them at the infrastructure layer, rotate them on a schedule, and scope each to one purpose so a leak is contained and revocable. Log secret use, not secret values. And review your tool results for accidental credential leakage — a misconfigured tool that returns a raw config object can spill a key into context, where the model might repeat it.
Defending against prompt injection
You cannot fully eliminate injection, so you architect to survive it. The core stance is to treat all tool results and external content as untrusted data, never as instructions — and to design so that even a fully hijacked agent cannot do irreversible harm because its privileges and egress are already constrained. Sandboxing, least privilege, and human-gated writes are your injection defenses precisely because they cap the damage when the model is fooled.
Add detection at the boundaries. Claude itself can act as a classifier on incoming content, flagging text that looks like embedded instructions before it reaches the acting agent. Strip or quarantine suspicious markup. For high-risk flows, run a moderation pass on both inputs and the agent's proposed actions, and block or escalate anything that tries to reach outside the task's intended scope. None of these is perfect alone; layered, they raise the cost of a successful attack sharply.
Be especially careful with content that mixes data and instructions by nature — emails, support tickets, scraped pages. A Skill that processes inbound email should assume every message may contain an attack and should never let message content trigger a privileged action without a gate. The pattern to internalize: data from the outside world informs decisions; it must never directly command tools.
Observability and the audit trail
You cannot secure what you cannot see. Log every tool call with its arguments, the result, the active Skill, and a request id, and keep those logs immutable. When something goes wrong — an unexpected write, a strange egress attempt — the audit trail is how you reconstruct what the agent did and why. It is also how you build detections: once you know what normal tool usage looks like, anomalies stand out.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Rehearse the failure. Run red-team prompts that attempt injection through every content channel your Skills touch, confirm the sandbox holds, confirm secrets never appear in context, and confirm destructive actions demand approval. Security for agents is not a one-time setup; it is a posture you test continuously as you add Skills and tools, because every new capability is a new door.
Frequently asked questions
Can prompt injection be fully prevented?
No. Because models cannot reliably separate instructions from data, you defend in depth instead: treat external content as untrusted, sandbox execution, enforce least privilege, gate destructive actions, and constrain egress so a successful injection cannot cause irreversible harm.
Where should API keys and secrets live?
In the tool or MCP-server layer, never in the model's context. The server holds the credential and returns only results, so the secret never enters the context window and cannot be leaked through the model.
Do I need a sandbox if my Skills only read data?
Yes if any Skill executes code or shell commands, and read access still deserves scoping and egress controls — a read-only agent that can call arbitrary external hosts can still exfiltrate whatever it reads. Default-deny egress regardless.
How do I gate destructive actions without killing autonomy?
Let reads run autonomously and require an explicit, logged approval only for writes, deletes, payments, and external sends — especially when inputs trace to untrusted content. Most of the agent's work stays automatic while the dangerous verbs stay controlled.
Bringing agentic AI to your phone lines
CallSphere builds these same guardrails — sandboxed tools, least-privilege access, and injection-aware design — into voice and chat agents that handle every call and message and book work 24/7 without exposing your systems. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.