Summary
When Sandbox.exec(..., timeout=N) fires, only the top-level process receives the kill signal. All child and grandchild processes are reparented to PID 1 and continue running indefinitely, consuming GPU/CPU resources until the sandbox itself is terminated.
This is a significant problem for any use case where exec runs a shell that spawns subprocesses (agents, training scripts, build pipelines, etc.).
Reproduction
Minimal reproduction using modal==1.3.4:
import asyncio
import modal
async def main():
app = await modal.App.lookup.aio("repro-orphan-test", create_if_missing=True)
sb = await modal.Sandbox.create.aio(app=app, image=modal.Image.debian_slim(), timeout=300)
# Run a process that spawns a child
proc = await sb.exec.aio("bash", "-c", "sleep 300 & echo CHILD=$!; wait", timeout=10)
# Wait for timeout to kill it
rc = await proc.wait.aio()
print(f"rc={rc}") # rc=-1 (killed by timeout)
await asyncio.sleep(2)
# Probe what's still alive
ps = await sb.exec.aio("ps", "-eo", "pid,ppid,stat,args", "--no-headers", timeout=5)
print(await ps.stdout.read.aio())
await sb.terminate.aio()
asyncio.run(main())
Expected: After the 10s timeout kills bash, the child sleep 300 should also be terminated.
Actual: sleep 300 is reparented to PID 1 and continues running:
PID PPID STAT ARGS
1 0 Ss /bin/dumb-init -- /bin/sh -c sleep infinity
2 1 Ss /bin/sh -c sleep infinity
3 2 S sleep infinity
4 0 Ss bash -lc sleep 300 & echo CHILD=$!; wait <-- orphaned, PPID=0
7 4 S sleep 300 <-- still alive
Tested with deeper process trees
We also tested with nested processes (bash → python3 → sleep) and with real agent tooling (bash → codex exec → python3 → subprocess). In all cases, the entire subtree survives:
| Scenario |
Timeout fires? |
Direct process killed? |
Children survive? |
bash → sleep 300 |
Yes (rc=-1) |
Yes |
Yes |
bash → python3 → sleep 300 |
Yes (rc=-1) |
Yes |
Yes (both python3 and sleep) |
bash → codex exec → python3 → subprocess |
Yes (rc=-1) |
Yes |
Yes (entire tree: node, codex, python3, child) |
Full process table after the codex test showing 5 orphaned processes:
4 0 Ss bash -lc env CODEX_API_KEY=... codex exec ...
7 4 Sl node /usr/bin/codex exec ...
37 7 Sl /usr/.../codex/codex exec ...
70 37 Ss python3 /tmp/long_job.py
74 70 S python3 -c import time; [time.sleep(1) for _ in range(300)]
Suggested fix
When the exec timeout fires, kill the process group (or walk the process tree) rather than just the single PID. The standard approach:
- Start the exec'd process in its own session/process group via
setsid
- On timeout, send
SIGTERM then SIGKILL to the negative PGID (kill -TERM -- -$PGID)
This is the same pattern used by systemd (KillMode=control-group), Docker's stop command, and CI runners. We've implemented this as an application-layer workaround in our codebase, but it belongs in Modal's runtime so all users benefit.
Workaround
Until this is fixed server-side, wrap commands in setsid and track the PGID for manual cleanup:
cmd_id = uuid4().hex[:8]
pid_file = f"/tmp/.exec_{cmd_id}"
wrapped = (
f"setsid bash -lc '{escaped_command}' &\n"
f"echo $! > {pid_file}\n"
f"wait $!\n"
f"exit $?"
)
proc = await sb.exec.aio("bash", "-c", wrapped, timeout=timeout)
try:
rc = await proc.wait.aio()
except Exception:
# Kill the process group on timeout
kill_proc = await sb.exec.aio(
"bash", "-c",
f"PGID=$(cat {pid_file}); kill -TERM -- -$PGID; sleep 2; kill -KILL -- -$PGID",
timeout=10,
)
await kill_proc.wait.aio()
Environment
modal==1.3.4
- Tested on A100-40GB sandboxes
- Linux container with
dumb-init entrypoint
Summary
When
Sandbox.exec(..., timeout=N)fires, only the top-level process receives the kill signal. All child and grandchild processes are reparented to PID 1 and continue running indefinitely, consuming GPU/CPU resources until the sandbox itself is terminated.This is a significant problem for any use case where
execruns a shell that spawns subprocesses (agents, training scripts, build pipelines, etc.).Reproduction
Minimal reproduction using
modal==1.3.4:Expected: After the 10s timeout kills
bash, the childsleep 300should also be terminated.Actual:
sleep 300is reparented to PID 1 and continues running:Tested with deeper process trees
We also tested with nested processes (
bash→python3→sleep) and with real agent tooling (bash→codex exec→python3→subprocess). In all cases, the entire subtree survives:bash→sleep 300rc=-1)bash→python3→sleep 300rc=-1)bash→codex exec→python3→ subprocessrc=-1)Full process table after the codex test showing 5 orphaned processes:
Suggested fix
When the exec timeout fires, kill the process group (or walk the process tree) rather than just the single PID. The standard approach:
setsidSIGTERMthenSIGKILLto the negative PGID (kill -TERM -- -$PGID)This is the same pattern used by systemd (
KillMode=control-group), Docker'sstopcommand, and CI runners. We've implemented this as an application-layer workaround in our codebase, but it belongs in Modal's runtime so all users benefit.Workaround
Until this is fixed server-side, wrap commands in
setsidand track the PGID for manual cleanup:Environment
modal==1.3.4dumb-initentrypoint