By Sagar Shankaran, Founder of CallSphere
OWASP made unexpected code execution a top-tier risk for agentic AI. Here is how to pick between microVMs, gVisor, and hardened containers - and what we run.
Key takeaways
TL;DR — OWASP Agentic AI Top 10 lists "Unexpected Code Execution" (ASI05) as a top-tier risk: never execute agent-generated code without strict sandboxing, input validation, and allowlisting. MicroVMs (Firecracker) win on isolation, gVisor wins on speed, containers should be reserved for trusted code.
Three failure modes from real 2026 incidents:
/etc or /var and persists across sessions.Plain Docker doesn't stop any of these reliably — shared kernel, default network access, weak resource caps. The 2025 incident where three coding agents leaked secrets through one shared injection happened because none had proper sandbox isolation between tenants.
flowchart LR
A[Agent] -->|generates code| B[Sandbox Manager]
B -->|spawn| C[Firecracker microVM]
C -->|exec| D[Workload]
D -->|stdout/files| E[Capture]
F[Egress Allowlist] --> C
G[CPU/Mem/Time Cap] --> C
H[FS Workspace] --> C
Red-team your sandbox with these probes: try fork() bomb, try writing to root FS, try outbound HTTP to a non-allowlisted URL, try reading /proc/self/environ for secrets, try mmap-bomb, try kernel exploit (CVE-2024-XXXX class). Each should fail and emit a clear log.
Use Northflank's sandbox test suite or build your own. Track: cold start time, p99 cleanup time, isolation level (kernel vs user-space vs container), egress controls, max concurrent sandboxes.
CallSphere doesn't expose code execution to end-users — but our internal agent harness uses E2B microVMs for any agent that runs Python, and Cloudflare Workers isolates for JavaScript-shaped tools. We never run agent-generated code in our main app namespace.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
For the 37 agents · 90+ tools · 115+ DB tables · 6 verticals, tools are pre-defined and code-reviewed. Where agents need to compute (e.g., a custom report), we route to a hardened sidecar with no DB access. Pricing $149 / $499 / $1499 · 14-day trial · 22% affiliate.
/workspace, nothing else.Is gVisor safe enough for untrusted code? Mostly. It blocks most kernel attacks, but kernel exploits in gVisor itself are rare but not zero. MicroVMs are stronger.
MicroVM cold start is slow — workaround? Pre-warmed pools. E2B and Modal both keep warm instances.
Can I just use Docker with seccomp? No. Shared kernel = shared attack surface. Use it only for trusted internal code.
How do I handle GPU workloads? GPU passthrough into Firecracker is fragile; consider Kata Containers or NVIDIA's gVisor variant.
Where does CallSphere expose code execution? We don't, by design. Tools are pre-defined and reviewed. See our demo for the agent shapes; pricing lists exposed surfaces.
Safe Code Execution Sandboxes for AI Agents: A 2026 Architecture Guide sounds like a single decision, but in production it splits into eval design, prompt cost, and observability. The deeper you push toward live traffic, the more those three pull against each other — better evals catch silent failures, prompt cost limits how often you can re-run them, and weak observability hides which retries are actually saving conversations versus burning latency budget.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The big fork is managed (OpenAI Realtime, ElevenLabs Conversational AI) versus self-hosted on GPUs you operate. Managed wins on cold-start, model freshness, and zero-ops; self-hosted wins on unit economics past a certain conversation volume and on data residency for regulated verticals. CallSphere runs hybrid: Realtime for live calls, self-hosted Whisper + a hosted LLM for async, both routed through a Go gateway that enforces per-tenant rate limits.
Latency budgets are non-negotiable on voice. End-to-end target is sub-800ms ASR-to-first-token and sub-1.4s first-audio-out; anything beyond that and turn-taking feels stilted. GPU residency in the same region as your TURN servers matters more than choosing a slightly bigger model.
Observability is the unglamorous backbone — every conversation produces logs, traces, sentiment scoring, and cost attribution piped to a per-tenant dashboard. HIPAA + SOC 2 aligned isolation keeps healthcare traffic separated from salon traffic at the storage layer, not just the API.
How does this apply to a CallSphere pilot specifically? CallSphere runs 37 production agents and 90+ function tools across 115+ database tables in 6 verticals, so most workflows you'd want already have a template. For a topic like "Safe Code Execution Sandboxes for AI Agents: A 2026 Architecture Guide", that means you're not starting from scratch — you're configuring an agent template that's already been hardened across thousands of conversations.
What does the typical first-week implementation look like? Day one is integration mapping (scheduler, CRM, messaging) and prompt tuning against your top 20 real call transcripts. Day two through five is shadow-mode running, where the agent transcribes and recommends but a human still answers, so you can compare side-by-side. Go-live is the moment your eval pass-rate clears your internal bar.
Where does this break down at scale? The honest answer: it scales until your tool catalog gets stale. The agent is only as good as the integrations it can actually call, so the operational discipline is keeping schemas, webhooks, and fallback paths green. The platform handles the rest — observability, retries, multi-region routing — without your team owning the GPU layer.
Want to see how this maps to your stack? Book a live walkthrough at calendly.com/sagar-callsphere/new-meeting, or try the vertical-specific demo at healthcare.callsphere.tech. 14-day trial, no credit card, pilot live in 3–5 business days.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Inside NVIDIA OpenShell — the open-source secure runtime for autonomous desktop agents. Sandboxing, policy enforcement, and why it matters in 2026.
How to build a safety eval pipeline that runs known jailbreak corpora, prompt-injection attacks, and tool-misuse scenarios on every release — and gates merges on it.
Stop the agent BEFORE it does the wrong thing. How to wire input and output guardrails in the OpenAI Agents SDK with cheap classifiers and an eval suite that proves they work.
NeMo Guardrails and LlamaGuard solve overlapping problems with different architectures. The trade-offs once you push them past 100 RPS in production agent stacks.
Prompt injection is still the top open agent security risk in 2026. The five defense patterns that work, and the two that do not — with real attack-and-defend examples.
Indirect prompt injection is the top agentic-AI vulnerability of 2026. The ten attack vectors actually being exploited in production.
© 2026 CallSphere LLC. All rights reserved.