Security Hardening for Claude Multi-Agent Systems
Sandbox execution, grant least privilege, keep secrets out of context, and defend against prompt injection in Claude multi-agent systems.
The moment you give an agent tools, you have given it the ability to act, and the moment you connect several agents together, you have built a system where a single poisoned input can ripple outward through tool calls you never explicitly authorized. A multi-agent Claude system that reads from the web, writes to a database, and runs code is not a chatbot — it is an automated operator with credentials, and it deserves the same security scrutiny you would apply to any service that holds those powers. Hardening these systems is not optional polish; it is what separates a demo from something you can point at production data.
This post lays out the layers that matter: sandboxing what agents can execute, granting least privilege per agent, keeping secrets out of model context, and defending against prompt injection, which is the threat unique to systems driven by language models.
The expanded attack surface of multi-agent systems
A single agent has one context window and one set of tools. A multi-agent system multiplies the surface in two ways. First, every subagent is another entity with its own tool access, so a misconfigured subagent is a new hole. Second — and this is the subtle one — subagents pass data to each other, which means a malicious instruction smuggled into one agent's input can travel to another agent as trusted content. The orchestrator believes it is reading a faithful summary; it is actually reading an attacker's payload that a subagent ingested from a web page.
A working definition to anchor on: prompt injection is any attack where untrusted content consumed by the model is interpreted as instructions rather than as data. In multi-agent systems this is amplified, because untrusted content fetched by one agent can become the operating instructions of the next. Treat every piece of data that originated outside your system — web pages, documents, emails, API responses, even tool outputs — as untrusted until proven otherwise.
Sandboxing: contain what agents can do
If an agent can run code or shell commands, that execution must happen inside a sandbox with no path back to anything that matters. Run it in an isolated container with no host filesystem access, no ambient cloud credentials, and egress restricted to an explicit allowlist of destinations. The default posture is deny: an agent's code execution environment should be able to reach exactly the resources its task requires and nothing else. If a research agent only needs to read three internal APIs, those three are the entire network it can see.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Untrusted input arrives"] --> B{"From trusted source?"}
B -->|No| C["Tag as untrusted data"]
C --> D["Run in sandbox: no host, no creds"]
D --> E{"Action requests a tool?"}
E -->|High-risk| F["Require human approval"]
E -->|Low-risk & in scope| G["Allow via scoped token"]
E -->|Out of scope| H["Deny & log"]
B -->|Yes| G
Sandboxing also limits blast radius when an injection succeeds despite your defenses. If a poisoned document convinces an agent to run a destructive command, a properly sandboxed environment means the worst outcome is a wrecked ephemeral container, not a deleted production volume. Assume some attacks will get through and design so that getting through still does not cost you anything irreversible.
Least privilege per agent
Each agent should hold the narrowest set of tools and permissions its role requires. The research subagent gets read-only tools; the writer subagent gets the one write path it needs; no agent gets a tool it does not use. This is not just tidiness — it directly bounds what an injection can accomplish. An attacker who hijacks a read-only research agent can make it read things, which is bad but recoverable; they cannot make it delete records, because the deletion tool was never in scope.
Scope credentials the same way. A subagent that needs to query one table should hold a credential that can query only that table, not a broad service account that can touch everything. When you map tools to agents, ask for each pairing what the worst case is if this exact agent is fully compromised. If the answer is unacceptable, the agent has too much power and you split the role or narrow the credential. High-risk actions — irreversible writes, financial operations, anything touching customer data at scale — should route through an explicit human approval step rather than being granted to the model directly.
Secrets: keep them out of the model
The cardinal rule is that secrets never enter a model's context window. An agent should never see an API key, a database password, or a token in its prompt or its tool arguments. Instead, the execution layer holds the secret and injects it at call time: the agent calls a tool by name with non-secret arguments, and your tool runner attaches the credential on the way out, outside the model's view. The model knows a tool called send_invoice exists; it never knows the key that authenticates the call.
This matters because anything in context can surface in an output, a log, or a transcript handed to another agent. A leaked key that lived only in your execution layer is a non-event; a leaked key that the model echoed into a summary is an incident. Audit your tool definitions and result handling to confirm no credential ever round-trips through Claude, and scrub tool results for accidental secret material before they re-enter context.
Defending against prompt injection
Prompt injection has no single silver bullet, so you layer defenses. Separate trusted instructions from untrusted data structurally — make clear in the prompt which portion is your instruction and which is fetched content the model should treat as inert data, never as commands. Constrain the agent's authority so that even a fully successful injection cannot do real damage, which is exactly what least privilege and sandboxing buy you. And put a verification step between an agent that consumed untrusted content and any consequential action, so a hijacked agent's request to do something dangerous is caught before it executes.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
For multi-agent systems specifically, sanitize at the handoff. When a subagent that touched untrusted data returns a result to the orchestrator, treat that result as untrusted too, not as a trusted summary, because the subagent may be relaying an injected instruction. A monitoring agent — a separate cheap Claude check that reviews outputs for signs of manipulation or scope violation — adds a layer that is hard for an in-band attacker to subvert. Log every tool call with its arguments so that when something does slip through, you have the forensic trail to understand the breach and close the gap.
Frequently asked questions
What makes multi-agent security harder than single-agent?
Subagents pass data to each other, so a malicious instruction one agent ingests from untrusted content can reach another agent as if it were trusted. The attack surface grows with every agent and every handoff, and the orchestrator can end up acting on an attacker's payload it believes is a faithful summary.
How do I keep secrets out of Claude's context?
Hold credentials in your execution layer and inject them at call time. The agent invokes a tool by name with non-secret arguments, and your tool runner attaches the key outside the model's view. Nothing the model can see, log, or pass to another agent should ever contain a secret.
Can prompt injection be fully prevented?
No single defense eliminates it, so you layer them: structurally separate instructions from untrusted data, enforce least privilege so a successful injection cannot do damage, sandbox execution to bound blast radius, and verify consequential actions before they run. The goal is to make a successful injection harmless, not merely rare.
Should high-risk actions be fully automated?
Irreversible writes, financial operations, and bulk access to sensitive data should route through human approval rather than being granted directly to the model. Automate the reversible and the low-stakes; gate the rest behind a person who can refuse.
Bringing agentic AI to your phone lines
CallSphere runs hardened multi-agent voice and chat — sandboxed tools, least-privilege credentials, and injection defenses — so agents can act on real systems safely. See the live system at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.