Sandboxes aren’t new, and I’m not going to pretend they are. We all know what they do: run code in isolation so whatever chaos happens in there stays in there and doesn’t touch the host.
But it’s worth stepping back for a second. What’s more interesting is what we were doing before sandboxes became a thing in the AI world. gVisor and Firecracker showed up around 2018, both built internally at big cloud providers to solve a very specific, very scary problem — multiple customers running untrusted code on the same physical hardware. Docker containers share the host kernel, so one kernel exploit and someone could break out and own the whole server. That’s the problem those two were quietly solving long before you could just download either of them yourself.
Back to the present day, models write code now. Decent code, honestly. But they have no idea what your production environment looks like, no real grasp of intent, and absolutely no sense of consequences. Which brings me to the part where you’re probably thinking: why would you ever run AI-generated code in prod without someone reviewing it first?
A few reasons actually. Sometimes when you want to define your own logic at runtime. Think of things like custom transformations, dynamic queries, or user-defined workflows. You don’t know ahead of time what the code needs to do because it depends on the input, so the system has to generate and execute it on the fly. You could add human oversight, but the goal in these systems is to automate that loop while still trusting the execution. Then there’s the obvious one: anything involving arbitrary code execution needs a sandbox. And if you’ve come across code mode in agentic frameworks where a model generates code to call your tools, that’s a sandbox use case hiding in plain sight. The generated code needs to be scoped, locked down, and given only what it needs to do its job.
That’s the “why”. Sandboxes are what make all of this not terrifying. Think of them like a confined space where you can go a little crazy, but you define the confinement. Without that, you’re one inbox-deletion away from a very bad day.

If you haven’t seen the openclaw incident where a personal AI assistant went ahead and wiped a user’s entire inbox. Yeah, that happened. And a well-scoped sandbox is exactly the kind of thing that prevents it. Restrict deletions, be careful with writes, lock down network access where you don’t need it, and never expose env variables or credentials. Reads are mostly fine, everything else deserves a second thought.
That’s what we’re going to dig into here: what we're building in Flyte around sandboxing and why we built it the way we did.
flyte.sandbox: a cleaner way to run arbitrary code
Flyte 2.0 made writing agentic workloads a lot less painful. Tasks, traces, recovery logs, and asyncio primitives are there and they're clean. It's still Python, just far less of it compared to the DSL-heavy world of Flyte 1.0.
So why sandboxes? The answer starts with our users figuring it out before we did. We’ve always had `ContainerTask`. It lets you run code in isolated containers, any language, arbitrary execution. It was originally built more for non-Pythonic workloads but it's a solid primitive and it's been there for a while.
When people started building pipelines that had models generating code as part of them, they naturally gravitated towards `ContainerTask`. And fair enough, it just worked as it provided isolation out of the box and arbitrary code execution with no fuss.
Worth being upfront about something. What we ship today is container-based isolation, not the kernel-level hardening you’d get from Firecracker microVMs or gVisor. For most agentic workloads running generated code as part of a pipeline, it does the job well. But if your threat model demands stronger guarantees, that’s worth knowing upfront. Deeper isolation is something we’re actively thinking about for the future.
But `ContainerTask` still doesn’t feel like a sandbox unless you already know that’s what you’re using it for. So we wanted to close that gap by giving you something that looks and feels like what you’d actually expect. That’s where `flyte.sandbox.create()` came from.
Ephemeral code execution
E2B, Daytona and others come with strong sandbox primitives like interactive sessions, snapshotting, and runtime control. These make sense when you’re dealing with long-lived environments that need persistence.
What we're shipping today is different. Flyte’s code sandbox is a one-off execution model. Spin up a container, do the thing, throw it away with no frills and overhead. It's deliberately lightweight and targets a different set of use cases: when you just need arbitrary code to run safely and cleanly, without all the machinery that comes with a full-blown interactive sandbox.
It also supports stricter isolation when needed, including the ability to block network access entirely and prevent any outbound traffic.
We’re actively exploring richer capabilities like interactive sessions and snapshotting, especially for scenarios where execution is expensive and rerunning work isn’t ideal.
Automatic code generation
Okay, slight detour, but I’d feel bad not mentioning this because it’s directly relevant.
We have a plugin built on top of the sandbox called automatic code generation. You describe what the code should do, pass in some sample data and constraints, and get back a validated, runnable script. Here's what that looks like:
Under the hood, it generates code, infers dependencies, builds a sandbox image, writes and runs tests, and iterates on failures. Logic errors are patched, missing packages trigger rebuilds, and test failures update expectations. Everything runs inside isolated sandboxes.
You can also opt for the agent backend which uses the Claude agent SDK today to automate the entire loop.
Now, code generation is expensive. Multiple LLM calls, image builds, sandbox executions, and it adds up. Flyte handles this through two things: replay logs, which record every task execution so if the pipeline crashes midway it resumes from the failure point without redoing completed work; and caching, which skips re-execution entirely when the same inputs show up across separate runs. Together they make crash recovery and iterative development pretty cheap.
One more thing on the security side. You don’t need to send real data for code generation. Samples are usually enough for the model to do its job. You run the resulting code on actual data separately, so the model never sees it. That matters for a lot of use cases.
Just to set expectations, good code generation isn’t “prompt in, code out”. It hits issues, works through them, and improves. What you end up with is something usable, not just something that looks right.
So code generation gets you validated, runnable code. But what happens when that code needs to call other functions, your actual business logic, at runtime? That's where things get really interesting.
Programmatic tool calling
When this first came up, it wasn’t obvious what it would look like in practice. That’s also where Monty comes in. Monty is a Rust-based sandboxed Python interpreter built by Pydantic. It’s lightweight and can execute pure Python — variables, loops, conditionals, function calls — but has no access to the filesystem, network, imports, or the OS.
It’s intentionally very confined. This isn't your regular sandbox though. It's less about running arbitrary code and more about orchestrating workloads where your actual logic lives in functions you want to call at specific times, and those functions can execute in a separate environment, a code sandbox if needed.
You start with everything and try to remove the dangerous parts […] Monty's approach is the opposite: start from nothing, then selectively grant capabilities. The default is zero access — no filesystem, no network, no environment variables, strict resource limits. You explicitly opt in to each capability via external functions that you wrote, you control, and you can audit. — Pydantic blog, Monty
That’s a very different way to think about running code safely in a secure environment. You could even argue that, if your use case fits this model, where access is limited to a carefully selected set of functions or tools, you may not need a traditional code sandbox at all. The tradeoff is that safety now depends on how much you trust the tools, their implementations, and the boundaries around what they can do.
What makes Flyte a good fit for this model
A few things make Flyte a natural fit for this. First, Flyte spins up containers with specific permissions, secrets, and resources for each task, so infrastructure scoping is already built in. Second, models are genuinely good at Python. They are trained on enough code that generating reliable orchestration logic, control flow, conditionals, loops, is well within reach. And third, Monty starts in microseconds, so there's none of the cold start overhead you’d get from spinning up a VM or a full container just to run orchestration code.
What you end up with is a clean separation. The model generates orchestration code, and Flyte tasks handle the actual work such as data access, computation, and external APIs, each running in its own isolated container.
This is where Flyte 2.0 makes things more interesting. So you can treat your functions or tools as tasks, and have a parent task run model-generated code that calls them. That's programmatic tool calling, what some call code mode.
The key constraint here is that the model-generated code should only orchestrate. It should call the tasks it’s been given and nothing else. No network access, no going off-script, and no pulling in anything it wasn’t explicitly allowed to use. Isolation helps, but that alone isn’t enough. You also don’t want it spinning up arbitrary tasks or doing anything beyond what it was explicitly allowed to do.
How Flyte implements code-mode for agentic workloads
Your generated code runs inside one or more Monty sandboxes embedded within a worker container, with a bridge layer sitting between the two to coordinate execution.
The bridge handles routing calls and passing data around without exposing anything it shouldn’t.
When code inside Monty calls an external task, Monty pauses. The bridge picks it up and executes the task, either directly in the outer Python process or as a remote call through the Flyte controller. Once that finishes, Monty resumes with the result. From the code’s point of view, it just looks like a normal function call returning a value. It doesn’t know anything about what happened underneath.
Opaque IO types like `File`, `Dir`, and `DataFrame` are handled by the bridge layer and pass through the sandbox untouched. Your code can route them between tasks but can’t read or modify the contents.
And if you want to apply this to your regular tasks, `@env.sandbox.orchestrator` decorator does exactly that. It locks it down to orchestration only with no overhead worth worrying about.
Conclusion
Sandboxing is an old tale, I know. But what we’ve built in Flyte, especially the workflow sandbox primitive for programmatic tool calling, is something we think fills a real gap.
There’s a lot of noise around sandboxes right now and it's easy to get lost in it. The thing worth remembering is that there's no universal answer here. The right sandbox depends entirely on your use case, and hopefully this gave you some first-principles grounding to think that through: what sandboxes actually are, why they exist, and what problem each flavor is solving.
The two sandbox primitives we shipped, workflow sandboxes and code sandboxes, are the ones we felt you’d reach for most often, but we'd love to hear what would make building on Flyte feel even better.
We go deeper into sandbox usage patterns in the docs.
Thanks for reading. Hope this gave you a clearer sense of how we’re thinking about sandboxes in Flyte and what we’ve built around it. If you’re running agentic workloads and want to give it a try, we’d love to hear how it goes.




