Skip to content

Sandbox.exec timeout kills only the direct process — child/grandchild processes survive as orphans #3992

Description

@wise-east

Summary

When Sandbox.exec(..., timeout=N) fires, only the top-level process receives the kill signal. All child and grandchild processes are reparented to PID 1 and continue running indefinitely, consuming GPU/CPU resources until the sandbox itself is terminated.

This is a significant problem for any use case where exec runs a shell that spawns subprocesses (agents, training scripts, build pipelines, etc.).

Reproduction

Minimal reproduction using modal==1.3.4:

import asyncio
import modal

async def main():
    app = await modal.App.lookup.aio("repro-orphan-test", create_if_missing=True)
    sb = await modal.Sandbox.create.aio(app=app, image=modal.Image.debian_slim(), timeout=300)

    # Run a process that spawns a child
    proc = await sb.exec.aio("bash", "-c", "sleep 300 & echo CHILD=$!; wait", timeout=10)

    # Wait for timeout to kill it
    rc = await proc.wait.aio()
    print(f"rc={rc}")  # rc=-1 (killed by timeout)

    await asyncio.sleep(2)

    # Probe what's still alive
    ps = await sb.exec.aio("ps", "-eo", "pid,ppid,stat,args", "--no-headers", timeout=5)
    print(await ps.stdout.read.aio())

    await sb.terminate.aio()

asyncio.run(main())

Expected: After the 10s timeout kills bash, the child sleep 300 should also be terminated.

Actual: sleep 300 is reparented to PID 1 and continues running:

  PID  PPID STAT ARGS
    1     0 Ss   /bin/dumb-init -- /bin/sh -c sleep infinity
    2     1 Ss   /bin/sh -c sleep infinity
    3     2 S    sleep infinity
    4     0 Ss   bash -lc sleep 300 & echo CHILD=$!; wait    <-- orphaned, PPID=0
    7     4 S    sleep 300                                    <-- still alive

Tested with deeper process trees

We also tested with nested processes (bashpython3sleep) and with real agent tooling (bashcodex execpython3subprocess). In all cases, the entire subtree survives:

Scenario Timeout fires? Direct process killed? Children survive?
bashsleep 300 Yes (rc=-1) Yes Yes
bashpython3sleep 300 Yes (rc=-1) Yes Yes (both python3 and sleep)
bashcodex execpython3 → subprocess Yes (rc=-1) Yes Yes (entire tree: node, codex, python3, child)

Full process table after the codex test showing 5 orphaned processes:

    4     0 Ss   bash -lc env CODEX_API_KEY=... codex exec ...
    7     4 Sl   node /usr/bin/codex exec ...
   37     7 Sl   /usr/.../codex/codex exec ...
   70    37 Ss   python3 /tmp/long_job.py
   74    70 S    python3 -c import time; [time.sleep(1) for _ in range(300)]

Suggested fix

When the exec timeout fires, kill the process group (or walk the process tree) rather than just the single PID. The standard approach:

  1. Start the exec'd process in its own session/process group via setsid
  2. On timeout, send SIGTERM then SIGKILL to the negative PGID (kill -TERM -- -$PGID)

This is the same pattern used by systemd (KillMode=control-group), Docker's stop command, and CI runners. We've implemented this as an application-layer workaround in our codebase, but it belongs in Modal's runtime so all users benefit.

Workaround

Until this is fixed server-side, wrap commands in setsid and track the PGID for manual cleanup:

cmd_id = uuid4().hex[:8]
pid_file = f"/tmp/.exec_{cmd_id}"
wrapped = (
    f"setsid bash -lc '{escaped_command}' &\n"
    f"echo $! > {pid_file}\n"
    f"wait $!\n"
    f"exit $?"
)
proc = await sb.exec.aio("bash", "-c", wrapped, timeout=timeout)
try:
    rc = await proc.wait.aio()
except Exception:
    # Kill the process group on timeout
    kill_proc = await sb.exec.aio(
        "bash", "-c",
        f"PGID=$(cat {pid_file}); kill -TERM -- -$PGID; sleep 2; kill -KILL -- -$PGID",
        timeout=10,
    )
    await kill_proc.wait.aio()

Environment

  • modal==1.3.4
  • Tested on A100-40GB sandboxes
  • Linux container with dumb-init entrypoint

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions