Skip to content
Agentic AI
Agentic AI8 min read0 views

Securing Claude Agents: Sandboxing & Least Privilege (Building Agents With Skills)

Harden Claude agents with sandboxing, least-privilege tool scopes, secret isolation, and layered defenses against prompt injection.

An agent is software that takes untrusted input, decides on its own what to do, and then acts in the real world — running code, calling APIs, touching files. That combination is exactly what a security engineer loses sleep over. The moment a skill can execute a shell command or hit an internal endpoint, the agent stops being a chatbot and becomes a privileged actor whose instructions can come, in part, from whatever document or web page it just read. Building agents with Claude Agent Skills safely means treating the model as a powerful but fundamentally untrusted component and designing the blast radius around it accordingly.

The core principle is unchanged from decades of systems security: assume the controlled thing can be subverted, and make subversion cheap to contain. You do not need the model to be perfectly resistant to manipulation; you need the system around it to be built so that a manipulated model cannot do much damage. That reframing — from "make the AI safe" to "make the AI's environment unforgiving" — is the foundation of everything below.

Least privilege for tools and skills

Every tool you expose to an agent is a capability an attacker can try to invoke. The discipline is least privilege: grant each agent only the tools it genuinely needs, scoped as narrowly as possible. An agent that summarizes support tickets does not need a tool that deletes records. If it needs to read orders, give it a read-only order-lookup tool, not a general database client. The narrower each tool's contract, the smaller the set of things a hijacked agent can accomplish.

Scope the credentials behind those tools just as tightly. If a skill calls an internal API, the token it uses should be read-only and limited to the specific resources in play, never a broad admin key that happens to be lying around. When a tool must perform a sensitive write — issuing a refund, sending an external email, modifying production data — route it through an explicit confirmation or human approval step rather than letting the model trigger it autonomously. The model proposes; a guarded gate disposes.

flowchart TD
  A["Untrusted input enters agent"] --> B["Model plans an action"]
  B --> C{"Action sensitivity?"}
  C -->|Read-only| D["Run in sandbox, scoped token"]
  C -->|Write/external| E{"Approved by gate?"}
  E -->|No| F["Block & log"]
  E -->|Yes| D
  D --> G["Result returned to model"]
  G --> B

Sandbox the code an agent runs

When a skill includes scripts the agent executes — and many of the most useful ones do — that execution must happen in a sandbox, not on your host with full network and filesystem access. A proper sandbox gives the agent an isolated, ephemeral environment: a constrained filesystem, no ambient credentials, and egress limited to an explicit allowlist of destinations. If the model is talked into running something destructive, it destroys a throwaway container, not your infrastructure.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Egress control matters as much as filesystem isolation. A common exfiltration path is an agent persuaded to read a secret and then POST it to an attacker-controlled URL. If the sandbox can only reach the handful of hosts the task legitimately requires, that path closes. Default-deny on outbound network, allow the specific endpoints the skill needs, and log everything that tries to leave.

Keep secrets out of the model's context

The cleanest rule for secrets is that the model should never see them. API keys, database passwords, and signing tokens belong in the execution environment, injected into tools at call time, not pasted into prompts or skill files where they become part of the context the model reads and might echo back. If a credential is in the context window, assume it can leak — into a log, into an error message, into a response crafted by a clever injection. Tools should reference secrets by name and have the runtime resolve them outside the model's view.

This also means scrubbing tool results before they return to the model. If an upstream API includes a token or internal hostname in its response, strip it at the tool boundary. The model should receive the data it needs to reason and nothing that would be dangerous to surface in its next message.

Vet the skills and tools you install

Skills and MCP servers are code and instructions you are inviting into a privileged loop, and they deserve the same scrutiny you would give any dependency. A skill from an untrusted source can contain instructions that quietly steer the agent toward exfiltration, or scripts that reach for credentials the moment they run. Before adopting a third-party skill, read it the way you would review a pull request: what tools does it call, what does it ask the model to do, what would happen if its instructions were adversarial? Pin versions so a skill cannot silently change under you, and prefer skills whose source you can actually inspect over opaque bundles.

The same applies to MCP servers, which broker the agent's access to external systems. Treat each one as a trust boundary. Run it with the narrowest permissions it needs, isolate it from systems it has no business reaching, and log what it does. The convenience of dropping in a community skill or connector is real, but so is the fact that it now sits inside the privileged path between your model and your infrastructure. Supply-chain discipline is security discipline.

Be especially wary of skills that bundle their own scripts and reach for the network on load. A benign-looking formatting skill that quietly phones home is a classic trojan, and because the agent runs its instructions as if they were yours, the model will not flag the behavior as suspicious. Run anything new in your sandbox first, watch what it tries to read and where it tries to connect, and only then promote it to a context where it touches real data or real credentials.

Defending against prompt injection

Prompt injection is the signature attack against agents: malicious instructions hidden inside content the agent processes — a web page, an email, a PDF, a code comment — that try to hijack the agent's behavior. Prompt injection is an attack in which adversarial instructions embedded in untrusted data cause an AI agent to take actions its operator never intended. Because the model cannot perfectly distinguish your instructions from instructions hidden in the data it reads, you cannot fully solve injection at the prompt layer; you contain it at the architecture layer.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

The containment strategy follows from everything above. Treat all tool-returned and externally-sourced content as untrusted data, never as commands. Keep the agent's privileges minimal so a hijacked agent can do little. Gate every sensitive action behind confirmation so injected instructions cannot silently trigger a refund or a data export. And monitor: log the full reasoning and tool-call trace so that when something anomalous happens you can see it, alert on it, and replay it. Defense in depth — least privilege, sandboxing, secret isolation, and human gates — is what makes injection an annoyance rather than a breach.

Frequently asked questions

Can prompt injection be fully prevented?

Not at the prompt layer alone. Because the model processes instructions and data in the same channel, a sufficiently crafted injection can sometimes influence behavior. The durable defense is architectural: minimal privileges, sandboxed execution, secrets kept out of context, and human approval on sensitive actions, so a successful injection still cannot accomplish much.

What is the minimum sandboxing an agent needs?

If a skill executes code, it should run in an isolated, ephemeral environment with a constrained filesystem, no ambient credentials, and default-deny outbound networking limited to an explicit allowlist. That combination prevents both host damage and data exfiltration even when the model is manipulated into running something it should not.

Where should API keys and secrets live?

In the execution environment, never in the model's context. Tools reference secrets by name and the runtime resolves them at call time, outside the model's view. Anything that lands in the context window — including verbose tool results — should be scrubbed of credentials and internal identifiers first.

How do I let an agent perform risky actions safely?

Gate them. Let the model propose a sensitive write — a refund, an external email, a production change — but require an explicit confirmation or human approval step before it executes. The model decides what to attempt; a separate, trusted gate decides whether it actually happens, and every attempt is logged.

Bringing agentic AI to your phone lines

CallSphere builds these security patterns into voice and chat agents — least-privilege tools, sandboxed actions, and isolated secrets — so AI can act on calls and messages without putting your systems at risk. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.