Behind the scenes of Modal sandboxes

One thing I’ve learned from my toddler is that everyone wants a sandbox. In the past twelve months, the number of sandbox environments has exploded; it’s no longer a curiosity, it’s a core primitive. Modal’s sandbox product started as a weekend prototype 3 years ago, but now it’s handling hundreds of thousands of concurrent environments for some of the most demanding workloads in AI.

This post is a behind-the-scenes look at how they built it: the use cases that drove the design, the infrastructure decisions that made it possible, and the scheduling problems that emerged once "a lot of sandboxes" turned into "an absurd number of sandboxes."{{sandboxes-fn-1}}

Everyone is training coding agents, and they need sandboxes to do it

Sandboxes are isolated environments that can be dynamically defined at runtime. They are in a sense like a VM, but designed to be highly ephemeral and secure by default.

Many people think sandboxes exist exclusively for coding agents. Lovable (running on Modal), Ramp (also running on Modal), and Cursor have all discussed their sandbox requirements in recent posts: their agent writes code, that code runs in a sandbox, and the user gets a result. However, a new use case is emerging that is often more infrastructure-intensive and complex: reinforcement learning. RL requires unforeseen scale; to support this scale requires interesting and challenging engineering work.

Today’s coding models benefit from reinforcement learning from verifiable rewards. With verifiable rewards, you no longer need a complex reward model. The model generates code during rollout, runs the generated code against a test harness, and receives objective feedback: did the tests pass? Benchmarks like SWE-bench and terminal-bench provide structured environments with different container images and test suites that researchers can integrate directly into their RL training loop.

However, to complete this verification loop, the agent must actually execute the generated code. In a typical training or evaluation pass, the agent may generate anywhere from dozens to thousands of trajectories, each consisting of a sequence of code generations, tool invocations, and environment interactions. For reliable evaluation, those trajectories often need to be executed in isolated or resettable environments so the outcome of one run does not contaminate another. Because these are multi-step agents, verification cannot be limited to the final output alone: each step that reads files, runs commands, modifies state, or calls tools may affect subsequent behavior and therefore has to be executed and checked in context. As a result, the total number of executions scales multiplicatively across several dimensions: the number of tasks, the number of sampled trajectories per task, and the number of steps within each trajectory. What appears to be a small evaluation at the task level quickly expands into a huge volume of sandboxed executions. What starts as a unit test quickly turns into a stress test…for the researcher.

After executing the trajectories and identifying higher-reward outcomes, a policy update step is performed. The sandboxed environments are used during both rollout and evaluation to generate and score trajectories, while the actual policy optimization typically occurs separately on GPU-based training infrastructure. Because policy updates are performed on data collected from a moving policy, reducing the latency between trajectory generation and optimization decreases the degree of off-policy drift. This results in training data that is more closely aligned with the current policy and therefore improves update quality. Consequently, faster sandbox provisioning and execution increase the rate at which fresh, on-policy data can be incorporated into training, directly impacting overall learning efficiency.

At scale, RL training becomes a problem of orchestrating massive numbers of sandboxes. One of Modal’s customers, a major AI lab, is already running on the order of 100,000 concurrent sandboxes for RL workloads, with a stated goal of reaching 1 million. At some level, this is just the natural consequence of how learning happens. More trajectories yield a stronger learning signal, leading to better models. And because improving coding ability via RL tends to transfer to broader reasoning performance, the demand for this kind of infrastructure is only increasing. Meta recently released an open weights model for code generation where the RL was done with Modal.

So why use sandboxes? The main issue is isolation. When sampling tens of thousands of trajectories that involve generating and executing arbitrary code, each run can mutate state, interfere with other runs, or attempt unsafe operations. Without strong isolation, one trajectory can corrupt the environment for others or, in the worst case, impact the host system itself.

In practice, sandbox logs routinely show agents attempting destructive or undefined behavior, because they will try anything that appears to improve reward, regardless of side effects. This is why each trajectory needs a clean, isolated, and resettable environment. It is the same reason developers avoid testing in production, except here it is happening at orders-of-magnitude higher concurrency, with agents that will readily execute whatever sequence of actions appears locally optimal.

Modal sandboxes v1

There are entire startups just doing sandboxes. Modal's sandbox product, now leading the market, had a v1 built by Akshat (CTO) in just a weekend{{sandboxes-fn-2}}.

That statement deserves some context, because it is both true and misleading. This was possible because Modal had already spent years building the hard primitives: fast container startup, a custom filesystem (two actually), gVisor-based isolation, a scheduling layer that can pack workloads efficiently across a fleet of machines and clouds, and a storage layer that supports dynamic attachment. Sandboxes are built from those same primitives, just exposed through a different interface.

The core Modal product is declarative. A function is defined, decorated, and deployed, and Modal handles scaling the containers up and down. Users think in terms of inputs and outputs; they don’t manage individual containers. In contrast, sandboxes are imperative. A sandbox is created via Sandbox.create(), which returns a handle that can be managed directly: executing commands, attaching storage, reading files, and exposing ports. Each sandbox can have a different container image, resource profile, and storage configuration. These are runtime decisions about environment configuration, which match how the agent harness dynamically creates and interacts with execution environments for each trajectory.

By default, sandboxes have no credentials or privileges to interact with the rest of your Modal ecosystem. A Modal function can spin up other functions, access secrets, and chain calls across your infrastructure. A sandbox cannot, unless you explicitly hand it those credentials. You can buy your child the Lego set, but the kid can't go buy it themselves.

So for these reasons, most of the actual engineering work for V1 was API design. Modal spent their time really figuring out the right SDK for how people would want to create, configure, and interact with these environments. The infrastructure was already there. The question was: what's the right interface for an imperative, agent-driven model of compute?

Scaling past the weekend prototype

Though the initial development of Modal sandboxes was fast, keeping up with the changing needs of AI researchers and engineers has been more arduous.

As sandbox adoption took off — first with coding app platforms like Lovable mid-last year, then with the wave of background coding agents, then with RL workloads from AI labs — the shape of the infrastructure problem changed drastically.

Scheduling sandboxes at scale

One of the most interesting parts of the Modal architecture is their scheduler, which matches customer workloads with actually available infrastructure from cloud providers.

With the core Modal product, the number of active containers is relatively bounded and predictable; the platform manages their lifecycle and scheduling. In contrast, sandboxes are explicitly created with directly addressable environments, often with heterogeneous resource requirements and lifetimes - so the scheduler is no longer dealing with a uniform pool of short-lived function invocations, but potentially hundreds of thousands of independently managed containers, each with its own CPU, memory, filesystem, and networking configuration, and possibly distributed across regions.

For this reason, scheduling becomes a real-time systems problem. The system needs a continuously updated view of cluster state, since every placement decision depends on what resources are actually available at that moment - a sandbox can’t be placed on a node without sufficient free capacity. Those decisions must be made quickly and repeatedly as new sandboxes are created and existing ones change state.

Modal’s architecture uses a control plane backed by a database as the source of truth, but the scheduler operates over an in-memory view of cluster state to make low-latency placement decisions. That works well at smaller scales, but they’re now rethinking the design to handle orders-of-magnitude more concurrent sandboxes. At some point, a single scheduler becomes a bottleneck, and the system has to be partitioned. The challenge is that placement decisions depend on a sufficiently up-to-date view of available capacity, but maintaining even an approximately consistent view across machines gets harder as the number of nodes, sandboxes, and state transitions grows. Modal’s new architecture is designed to push through these constraints, unlocking sandbox orchestration at massive scale.

Dealing with capacity across regions

Some of Modal’s competitors operate within a single cloud region, which lets them aggressively cache container images and keep warm capacity close to the scheduler. They don’t mention this when they report strong performance on cold-start benchmarks. In comparison, Modal runs across multiple regions because their users need it — for things like data residency (e.g. EU-only deployments), lower-latency execution near end users, and access to aggregate capacity that doesn’t exist in any single region. Of course, that capability makes ensuring fast cold starts strictly harder. Caches are fragmented across regions; images may need to be pulled into a region on demand; and placement decisions must account for both resource availability and location constraints. Despite this, Modal’s cold-start performance remains state-of-the-art.

On top of that, Modal supports extremely fast cold starts for arbitrary container images rather than a small, curated set optimized for specific workloads. That makes fast startup harder to do well, since images are less likely to be pre-cached and environment setup can vary significantly between runs. But this is exactly the kind of systems problem Modal’s engineering team excels at, and they’ve built the infrastructure to make it work at scale.

Modal also supports GPU-backed sandboxes, which expands the set of possible workloads but adds another layer of scheduling complexity. GPUs are scarce, regionally fragmented resources with stricter placement constraints. But they’re necessary for use cases like automated kernel generation, ML experimentation loops, and video-processing agents that rely on hardware-accelerated pipelines like FFmpeg.

Another emerging use case is AI research agents. These agents generate hypotheses, design experiments, run training or evaluation loops, and synthesize results. Each step may require spinning up a GPU-backed environment with specific dependencies and data, often for short-lived but stateful workloads. GPU-backed sandboxes make these iterative, programmatic research loops possible.

Updates to the storage layer

This has been one of Modal’s most significant investments. Modal supports a few different ways to persist and reuse sandbox state, each designed for a different use case.

Filesystem snapshots capture the full filesystem of a sandbox at a point in time. They can be used to resume work later or to start new sandboxes from that exact state. Because they only store what changed from the base image, they remain efficient and integrate with Modal’s fast startup path. This is the right primitive for long-running jobs or workflows that need to hibernate between interactions and pick up where they left off.

Directory snapshots are more lightweight. Instead of saving the whole environment, they capture a specific working directory. This makes it possible to reuse project files or intermediate outputs across different sandboxes, while swapping out the underlying container image or environment.

Memory snapshots go a step further by capturing both the filesystem and the in-memory state of a running sandbox. In principle, this allows execution to resume from the exact same point, although the feature is still in alpha and comes with limitations for now.

Volumes are separate from snapshots. They provide shared, persistent storage that can be mounted across sandboxes and functions, and are better suited for longer-lived data than for checkpointing execution.

For AI engineers and researchers – and in particular those running agents in prod – this matters because agent workflows are inherently stateful. A trajectory is not a single function call but a sequence of steps that modify files, environments, and intermediate results. These primitives make that state explicit and controllable. Instead of rerunning entire trajectories, it becomes possible to checkpoint progress, branch from a shared starting point, and explore multiple continuations in parallel. This leads to faster experimentation, cleaner comparisons, and easier debugging.

Improving packing efficiency

A single VM can host hundreds of sandboxes, depending on their CPU and memory requirements. Each sandbox typically consumes only a fraction of a core and a small amount of RAM, allowing many to be packed onto the same machine. This is far more efficient than provisioning a VM per environment, where fixed overhead dominates. At that point, packing and scheduling become critical, since overall system efficiency depends on how well available resources are utilized without causing contention. This matters because sandbox density directly determines how many trajectories can be executed in parallel, which in turn sets the upper bound on data throughput for training and evaluation.

Try Modal sandboxes yourself!

At the time of writing, Modal can currently spin up hundreds of sandboxes per second for a single customer. But customers are asking for quite a bit more. One customer wants a million concurrently. These colossal numbers are driven by real RL training workloads where sandbox throughput is a bottleneck on model improvement.

If you're building RL training pipelines, coding agents, or any workflow that needs fast, ephemeral, isolated compute, you can get started with Modal Sandboxes via $30 worth of free monthly compute. And if this kind of systems work sounds interesting to you, Modal is hiring.

Authors

Sarah Catanzaro

Editors

Justin Gage

Acknowledgments

Thank you to the Modal team for doing all of this engineering work and letting me interview them about it.