---
title: "Securing Claude agents: sandboxing, secrets, injection"
description: "Harden Claude agents on Skills and MCP: sandboxing, least privilege, secret handling, and layered prompt-injection defense that contains the blast radius."
canonical: https://callsphere.ai/blog/securing-claude-agents-sandboxing-secrets-injection
category: "Agentic AI"
tags: ["agentic ai", "claude", "security", "prompt injection", "mcp", "sandboxing", "least privilege"]
author: "CallSphere Team"
published: 2026-02-18T11:46:22.000Z
updated: 2026-06-06T21:47:44.786Z
---

# Securing Claude agents: sandboxing, secrets, injection

> Harden Claude agents on Skills and MCP: sandboxing, least privilege, secret handling, and layered prompt-injection defense that contains the blast radius.

The moment you give an agent tools, you give it the ability to act — and an attacker who can influence what the agent reads can influence what the agent does. A Claude agent that browses a web page, opens a support ticket, or reads an email is consuming untrusted text, and that text can contain instructions. When you extend Claude with Skills and MCP servers that can write files, hit APIs, or move money, the security model shifts from "protect the prompt" to "contain the actions." This is the discipline of hardening agentic systems, and it deserves the same rigor you'd apply to any system with real-world side effects.

This post covers the four pillars that matter most: sandboxing what the agent can touch, granting least privilege, handling secrets so they never reach the model, and defending against prompt injection — the signature attack of the agentic era.

## The agentic threat model in one paragraph

Classic application security assumes code does what its author wrote. Agentic security cannot assume that, because the agent's behavior is shaped at runtime by whatever data it ingests. The core threat is that untrusted content — a web page, a document, a tool result, an inbound message — carries instructions that the model follows as if they came from you. The blast radius is whatever the agent's tools can do. So the defining question of agent security is not "can the model be tricked?" — assume it can — but "what is the worst thing it can do once tricked?"

A useful definition: prompt injection is an attack in which an adversary embeds instructions inside data the model processes, causing the agent to take actions the operator never intended. Because the data and the instructions share the same channel — text — you cannot fully filter injection out. You design so that a successful injection still can't do much damage.

## Sandboxing: contain what the agent can reach

Sandboxing is the first and strongest control because it limits blast radius regardless of what the model decides. Run tool execution — especially anything that executes code or touches the filesystem — inside an isolated environment: a container or VM with no access to the host, a scoped working directory, and tight network egress rules. If a code-execution tool can only see a temp directory and can only reach an allowlist of hosts, an injected "exfiltrate the database" instruction has nowhere to send the data.

```mermaid
flowchart TD
  A["Untrusted content enters run"] --> B["Claude proposes a tool call"]
  B --> C{"Action mutates or sends data?"}
  C -->|Read-only| D["Run in sandbox, allowlisted hosts"]
  C -->|Write/spend/send| E{"Within granted scope & policy?"}
  E -->|No| F["Block & require human approval"]
  E -->|Yes| G["Execute with scoped short-lived token"]
  D --> H["Return sanitized result to model"]
  G --> H
```

Network egress control deserves special attention. Many real-world exfiltration paths in agentic systems run through an outbound request the agent was allowed to make. Default-deny egress and allowlist only the hosts each MCP server legitimately needs. The same goes for the filesystem: give each run a fresh, scoped directory rather than access to a shared workspace where secrets or other tenants' data might live.

## Least privilege for tools and MCP servers

Every tool you expose is a capability you must defend. The fix is least privilege at two layers. First, scope the toolset to the task: a customer-support agent does not need a tool that deletes production records, so don't load it. Skills help here because they bundle exactly the tools a job needs, and nothing else sits in context tempting the model.

Second, scope the credentials behind each tool. If an MCP server reads orders, its API key should have read-only access to orders and nothing else. Resist the convenience of one all-powerful service account shared across every tool. When a tool can be tricked, you want its underlying permissions to be the narrowest possible, so the worst case is bounded by design rather than by hoping the model behaves.

For any irreversible or high-impact action — issuing a refund, sending an external email, deleting data, spending money — require an explicit confirmation step. A human-in-the-loop gate on the dangerous subset of actions costs little and turns a silent compromise into a blocked request that someone reviews.

## Secrets the model should never see

A hard rule: API keys, tokens, and passwords should never enter the prompt or the model's context. The agent decides *that* a tool should be called; the execution layer holds the credentials and injects them when it makes the actual API request. Keep secrets in a vault or the environment of the tool runner, not in skill files, not in system prompts, and not in tool arguments.

Prefer short-lived, scoped tokens minted per run over long-lived static keys. If a token leaks through a log or a verbose tool response, a fifteen-minute lifetime and a narrow scope sharply limit the damage. And scrub tool outputs before they return to the model: if an upstream API echoes a credential or a PII field in its response, strip it at the server boundary so it never lands in a transcript you might later store or replay.

## Defending against prompt injection in depth

You cannot perfectly detect injection, so you defend in layers. Separate trusted instructions from untrusted data clearly in the prompt structure, and label external content as data the model should treat skeptically, not as commands. Be especially careful with tool results: a web page or document fetched mid-run is untrusted, and the agent should not obey instructions it finds there.

Then assume one layer fails and lean on the others. Per-action policy checks at the execution boundary catch a tricked agent before it acts: if a refund exceeds a threshold or a destination address isn't on the allowlist, block it regardless of how confidently the model asked. Log every tool call with its arguments so post-incident review is possible. And monitor for anomalies — a support agent suddenly attempting bulk deletes is a signal worth alerting on. Defense in depth means no single trick, and no single bad model decision, is enough to cause real harm.

## Hardening as an ongoing practice

Security for agents is not a one-time review. Each new MCP server, each new skill, each new tool widens the attack surface, so make a lightweight threat review part of adding any capability: what can it touch, what credentials does it hold, what's the worst case if it's tricked? Maintain a suite of adversarial test cases — documents and inputs that try to hijack the agent — and run them against the system the way you'd run security regression tests. The systems that stay safe are the ones where containment, not detection, does the heavy lifting.

## Frequently asked questions

### Can I just filter out prompt injection with a classifier?

A classifier raises the bar but can't be your only defense, because injection shares the same text channel as legitimate data and attackers adapt. Treat detection as one layer and rely primarily on containment: sandboxing, least privilege, and per-action policy checks that limit what a tricked agent can actually do.

### Where should API keys for MCP tools live?

In the execution layer or a secrets vault, never in the prompt, skill files, or tool arguments. The model should decide which tool to call; the runner attaches credentials when making the real request. Prefer short-lived, narrowly scoped tokens over long-lived static keys.

### Which actions deserve a human-in-the-loop gate?

Anything irreversible or high-impact: spending money, deleting data, sending external communications, or changing access. Read-only and low-risk actions can run autonomously, while the dangerous subset gets an explicit confirmation, keeping friction low and blast radius bounded.

### Does sandboxing hurt agent capability?

Done well, barely. Scope each run to the directories, hosts, and permissions it genuinely needs rather than locking it out of everything. The agent keeps the access required to do its job while losing the access an attacker would exploit.

## Safe agents on every call and message

When an agent handles live customer conversations, containment and least privilege aren't optional. CallSphere builds these hardening patterns into its **voice and chat** agents, so assistants can use tools mid-conversation without ever overstepping their scope. See it live at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/securing-claude-agents-sandboxing-secrets-injection
