Skip to content
Agentic AI
Agentic AI9 min read0 views

Securing Claude Agents: Sandboxing and Injection Defense

Harden Claude Managed Agents: sandbox tools, least-privilege scopes, keep secrets off the model, and contain prompt injection so hijacks stay harmless.

An agent that can call tools is an agent that can take actions in the real world — send emails, move money, delete records, run shell commands. That is exactly what makes Claude Managed Agents useful, and exactly what makes them a security surface unlike anything a chatbot ever was. The threat model is not "the model says something rude." It is "an attacker hides an instruction in a web page the agent reads, and the agent dutifully exfiltrates your data using a tool you gave it."

Securing an agent is fundamentally about limiting blast radius. You assume the model can be manipulated — because, given untrusted input, it can — and you build the system so that even a fully hijacked agent cannot do catastrophic damage. This post walks through the four pillars that get you there: sandboxing, least privilege, secrets handling, and prompt-injection defense.

Key takeaways

  • Prompt injection is the defining agent threat: untrusted content the agent reads can carry instructions it may follow.
  • Defense is containment, not just prevention — design so a hijacked agent still cannot reach sensitive actions.
  • Run tool execution in a sandbox with no ambient network or filesystem access beyond what the task needs.
  • Give each agent the minimum tools and scopes for its job; never hand it broad credentials "to be safe."
  • Keep secrets out of the context window entirely; inject them at the tool boundary, never in the prompt.

The agent threat model

A useful definition: prompt injection is an attack where adversarial instructions are embedded in content the agent processes — a document, a web page, an email, a tool result — causing the agent to act on those instructions as if they came from its operator. The agent cannot reliably tell the difference between "data to analyze" and "commands to follow" because to a language model both are just text.

This is not a hypothetical edge case; it is the central design constraint. Any agent that reads content from outside your trust boundary — the web, user uploads, third-party APIs, inbound email — must be treated as potentially carrying hostile instructions. The corollary is that you cannot rely on the model refusing. You assume the instruction lands and the agent tries to act on it, then make sure the system stops the action from causing harm.

Pillar one: sandbox tool execution

Tools that run code, touch the filesystem, or make network calls must execute in an isolated environment. The sandbox is what stands between a hijacked agent and your infrastructure. A well-built sandbox gives the agent exactly the resources its task requires and nothing else: a scratch directory it can write to, the specific network endpoints it legitimately needs, and no path to your internal services, credentials, or other tenants' data.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Agent decides to call a tool"] --> B{"Tool runs code or hits network?"}
  B -->|No| C["Execute read-only tool"]
  B -->|Yes| D["Dispatch into sandbox"]
  D --> E["Apply egress allowlist + scoped filesystem"]
  E --> F{"Action within policy?"}
  F -->|No| G["Block + log + warn operator"]
  F -->|Yes| H["Execute, return structured result"]
  H --> I["Strip secrets from result before context"]

The key property is egress control. Many real exfiltration attacks work by getting the agent to make an outbound request to an attacker-controlled URL with your data in the query string. An egress allowlist — the sandbox can only reach approved hosts — defeats this even if the injection succeeds, because the data has nowhere hostile to go. Treat "the agent can reach arbitrary URLs" as a vulnerability, not a feature.

Pillar two: least privilege for tools and scopes

The most common over-provisioning mistake is handing an agent a powerful, broadly-scoped credential because it is convenient. An agent that needs to read calendar availability should not hold a token that can also delete events and read every mailbox in the organization. Scope the credentials to the exact operations the agent's job requires, and prefer read-only access wherever the task does not require writes.

Apply the same discipline to the tool catalog itself. Every tool you expose is a capability an injected instruction could try to invoke. If the agent never legitimately needs to delete records, do not give it a delete tool — not because you expect it to misbehave, but because a tool that does not exist cannot be abused. Separate high-risk actions behind an approval gate: the agent can propose a refund or a deletion, but a human or a policy engine must confirm it before it executes. This single pattern converts the worst-case outcome from "catastrophe" to "a suspicious proposal that got rejected."

Pillar three: keep secrets out of the context

A secret that enters the context window is a secret that can be leaked, because anything in context can end up in an output, a log, or a tool argument the agent constructs. The rule is absolute: API keys, database passwords, and tokens never appear in the prompt. Instead, the tool layer holds the credentials and injects them at execution time. The agent calls send_invoice(customer_id); it never sees, and never needs, the payment provider's secret key — the tool implementation attaches it server-side.

This boundary also protects you from exfiltration. If the agent is hijacked and instructed to "print all credentials," there are no credentials in its context to print. Equally important, sanitize tool results before they re-enter context: if a backend response happens to include a secret field, strip it at the tool boundary so it never reaches the model. Defense in depth means assuming each layer can fail and ensuring the secret simply is not present where it could leak.

Pillar four: defend against prompt injection

Since you cannot perfectly prevent injection, you layer mitigations. First, structurally separate untrusted content: wrap external documents in clear delimiters and instruct the agent to treat their contents strictly as data to analyze, never as instructions to follow. This does not guarantee compliance, but it meaningfully raises the bar and pairs well with the containment controls above.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Second, constrain what the agent can do after reading untrusted content. A strong pattern is to drop the agent's privileges once it has consumed external data — an agent that has just read the open web should not, in the same turn, be allowed to call a money-moving tool. Third, monitor for the signatures of a successful injection: sudden attempts to reach unusual hosts, tool calls that do not fit the task, or arguments that reference data the agent should not be acting on. The combination of separation, post-read privilege reduction, and monitoring is far more robust than any single guardrail.

ControlStopsEven if injection succeeds?
Egress allowlistData exfiltration to attacker URLsYes — data has nowhere to go
Least-privilege scopesDestructive or out-of-scope actionsYes — capability absent
Secrets at tool boundaryCredential leakageYes — secret not in context
Human approval gateHigh-risk actions executingYes — action requires confirmation
Content delimitingSome injection attemptsPartial — raises the bar

Common pitfalls

  • Trusting the model to refuse injected instructions. Treat refusal as a bonus, never a control. Build containment that holds when refusal fails.
  • Giving agents broad credentials for convenience. Scope every token to the minimum operations; prefer read-only and gate writes.
  • Putting secrets in the system prompt. They will eventually surface in a log or output. Inject at the tool boundary instead.
  • Allowing arbitrary network egress from tools. Without an allowlist, exfiltration is one successful injection away.
  • No human gate on irreversible actions. Refunds, deletions, and external sends should require confirmation, not just the agent's judgment.

Harden an agent in 6 steps

  1. Inventory every tool and the exact capability and scope it grants; remove anything not strictly required.
  2. Move all secrets out of the prompt and into the tool layer, injected server-side at execution.
  3. Run code-executing and network-touching tools in a sandbox with a strict egress allowlist.
  4. Put irreversible or high-value actions behind a human or policy approval gate.
  5. Delimit untrusted content as data and reduce privileges after the agent reads external input.
  6. Monitor tool calls and egress for injection signatures, and alert on out-of-policy attempts.

Frequently asked questions

Can prompt injection be fully prevented?

No technique reliably prevents it today, because the model processes instructions and data as the same medium. The realistic goal is containment: assume an injection may land and architect the system — least privilege, sandboxing, approval gates — so the worst a hijacked agent can do is harmless. Prevention measures reduce frequency; containment limits damage.

Where should secrets live if not in the prompt?

In your tool implementation or a secrets manager the tool layer reads at execution time. The agent references a resource by ID and the tool attaches the credential server-side. The model never sees the secret, so it cannot leak it through outputs, logs, or constructed tool arguments.

Do read-only agents need this much hardening?

They need less, but not none. A read-only agent can still be induced to exfiltrate the data it can read by making an outbound request, so egress control and content delimiting still matter. The biggest savings is that you can skip approval gates for actions that do not exist — which is itself the least-privilege principle working for you.

How do I test injection defenses?

Red-team the agent with adversarial inputs — documents and web pages that embed instructions to exfiltrate data, call forbidden tools, or reach external URLs — and verify the containment controls block the action even when the model is fooled. Treat any successful exfiltration in testing as a release blocker, not a tuning note.

Bringing agentic AI to your phone lines

CallSphere runs voice and chat agents under the same hardening — least privilege, sandboxed tools, and secrets kept off the model — so agents can act mid-conversation and book work 24/7 without expanding your attack surface. See it live at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.