How to sandbox AI agents that execute code in production
To sandbox an AI agent that executes code in production, give it an ephemeral, isolated runtime, task-scoped credentials, a network egress allowlist, and reversible actions. The goal is not to predict every bad output the model might produce. It is to shrink the blast radius so that when the agent makes a wrong decision - and it will - the damage is contained to a throwaway environment instead of your production database, your secrets, or your cloud bill.
Most teams ship agent features by wiring a model to a set of tools: run SQL, call an internal API, execute a Python snippet, hit a shell. That works in the demo. The problem shows up the first time the model decides to run DELETE FROM users because a prompt was ambiguous, or fetches a URL an attacker planted in a support ticket. The model does not need to be malicious to cause an incident. It only needs to be wrong once, with real permissions.
What blast radius means for an AI agent
Blast radius is the set of things a single agent action can touch. In a normal service you scope this with code review, typed interfaces, and deploy gates. An agent removes all three: the "code" is generated at runtime, the interface is natural language, and there is no human reading the action before it fires.
So the question shifts. You stop asking "will the model always do the right thing" (it will not) and start asking "what is the worst thing this action could do, and is that acceptable." If the honest answer is "drop a table" or "exfiltrate every customer record," you do not have a prompt problem. You have a containment problem, and no amount of prompt engineering fixes it.
The OWASP Agentic AI Top 10, published at the end of 2025, puts unexpected code execution near the top of the list precisely because this pattern is now everywhere: code interpreters, data-analysis agents, automation bots, and internal copilots that were given real tools to be useful. Useful and unbounded are not the same thing.
Why containers and an approval button are not enough
Two common answers fall short on their own.
The first is "we run it in a container." A standard container shares the host kernel, so a container escape or a mounted credential is a host compromise. Containers are a packaging boundary, not a security boundary, when the workload inside is untrusted code you did not write. For agent execution you want a deeper isolation layer - a microVM (Firecracker-style), a user-space kernel like gVisor, or a WebAssembly runtime for pure compute - so a break inside the sandbox does not reach the machine.
The second is "a human approves every action." Approval gates are valuable for genuinely destructive, low-frequency operations, and we have written about where human-in-the-loop approval fits in an agent product. But approval does not scale to the hundreds of small tool calls an agent makes per task, and humans rubber-stamp what they cannot fully read. Approval is a policy control. Sandboxing is a technical control. You need both, and you should never lean on approval to cover a missing sandbox.
The four boundaries every agent execution needs
A practical containment stack has four independent layers. Each one assumes the others might fail.
Isolate the runtime, and make it disposable
Run every code-executing action in a fresh sandbox that is destroyed when the task ends. Ephemerality is half the value: if the environment lives for one task and then disappears, a poisoned file, a lingering process, or a modified dependency cannot follow the agent to the next request. Managed microVM sandboxes (E2B and similar) exist for exactly this, but the principle matters more than the vendor - short-lived, kernel-isolated, and clean on every run.
Scope credentials to the task, not the agent
The most damaging incidents come from an agent holding standing credentials with broad rights. Instead, mint short-lived, task-scoped tokens at the moment the action runs. A support agent answering a billing question gets read access to that one customer's invoices for the next 60 seconds - not a service-role key that can read every tenant. If a token leaks or the model is tricked, the attacker inherits only what that task needed. This is the single highest-leverage change most teams can make, and it usually costs nothing but plumbing.
Lock down network egress
Default the sandbox to no outbound network, then allowlist only the specific hosts the task requires. Open egress is how a prompt injection buried in retrieved content turns into data exfiltration: the model reads a malicious instruction, then POSTs your data to an attacker's endpoint. If the only reachable hosts are your own API and the two vendors the task needs, that exfiltration path is closed no matter what the model was convinced to do.
Make side effects reversible
For tools that change state - write to a database, send a message, refund a charge - design the action so a mistake is recoverable. Prefer soft deletes over hard deletes, idempotency keys over blind retries, and a dry-run mode that returns the intended change without committing it. Log every tool call with its inputs and the identity it ran as, so an incident is auditable in minutes rather than reconstructed from guesses. Reversibility is what lets you sleep after enabling autonomy.
A concrete example: a support agent that runs SQL
Say you are adding an agent that answers customer questions by querying your production database. The naive version connects with the app's database user and runs whatever SQL the model writes. The blast radius is your entire schema, every tenant, read and write.
The contained version looks different. The agent runs inside a disposable sandbox with no standing network access. When it needs data, it does not get a database connection at all - it calls a narrow internal tool, lookup_invoices(customer_id), that runs a parameterized, read-only query behind a role scoped to a single tenant. The SQL the model "writes" never touches the database directly; it is confined to a reviewed function with a fixed shape. If the model hallucinates a destructive query, there is no code path that can execute it.
Now the worst case is bounded: the agent can, at most, read invoices for the customer it was asked about. That is a feature you can ship. The version with a raw database connection is an incident waiting for its trigger. The design work is choosing narrow tools over broad access - the same discipline we cover in designing tools for production AI agents.
How to roll this out without rebuilding everything
You do not need a platform team to start. Rank your agent's tools by worst-case damage: which single call could delete data, move money, or leak records. Contain those first. A read-only analytics agent needs far less than one that can execute shell commands or trigger deployments, so spend your effort where the blast radius is largest.
From there, add the layers in order of leverage: task-scoped credentials first (biggest risk reduction for the least work), then egress allowlisting, then a disposable sandbox for anything that runs generated code, then reversibility on write tools. Wire these into the same failure-handling and logging you use elsewhere - the same instincts that keep MCP servers from failing silently apply here, and containment belongs in your pre-launch checks alongside the rest of your AI feature guardrails before launch.
None of this is exotic. It is standard least-privilege and isolation practice, applied to a component that happens to write its own code at runtime. The teams that get burned are the ones who treated the agent as trusted because it was helpful. The teams that ship confidently treated every agent action as untrusted input and bounded it accordingly.
Frequently asked questions
Do I need a sandbox if my agent only calls approved APIs and never runs raw code?
You need less, but not nothing. If the agent cannot execute arbitrary code, runtime isolation matters less. But side-effecting API calls still need task-scoped credentials, egress control, and reversibility, because a wrong tool call with broad permissions is its own incident. The blast-radius question applies to any action, not just code execution.
Is a Docker container enough to sandbox agent code?
For untrusted, model-generated code, a plain container is a weak boundary because it shares the host kernel. Use a stronger isolation layer - a microVM, gVisor, or a WebAssembly runtime - for anything that executes generated code, and keep the container for packaging. Pair it with no default network egress and task-scoped secrets regardless of the runtime you pick.
Won't all this isolation slow the agent down?
Disposable microVM sandboxes now start in well under a second, and task-scoped tokens and egress rules add negligible latency. The real cost is design time, not runtime. Weighed against a single incident that exposes customer data or deletes production records, the overhead is not the expensive part.
Where should a small team start?
Start with the one tool whose worst case is the most damaging, and give it task-scoped credentials so it can only touch what the current task needs. That single change removes most of the risk for a fraction of the effort. Add disposable sandboxing and egress allowlists as the agent gains more powerful tools.
Rather we just build it?
Book a free scoping call and we'll ship your production-safe AI feature this week.