Skip to content

Direct-mode GPU accounting and inherited Slurm/PMI environment break multi-node GPU workflows #249

@nkeilbart

Description

@nkeilbart

Summary

There appear to be multiple issues in Torc's direct-mode behavior for GPU workflows inside multi-node Slurm allocations.

I hit a panic in JobRunner when running a 2-node allocation with 4 GPUs per node:

thread 'main' panicked at src/client/job_runner.rs:1825:9:
assertion failed: self.resources.num_gpus >= 0

While debugging this, it also became clear that direct mode has additional limitations / correctness issues for GPU jobs in multi-node allocations.

Environment / Topology

  • Slurm allocation: 2 nodes
  • GPUs per node: 4
  • Total GPUs in allocation: 8
  • Execution mode: direct
  • Workload type: single-node GPU jobs (num_gpus: 1)
  • Also investigated behavior for true multi-node GPU jobs

What I observed

1. Multi-node GPU accounting panic

In a 2-node allocation, the runner process on the first node saw:

echo $SLURM_JOB_GPUS
0,1,2,3

echo $CUDA_VISIBLE_DEVICES
0,1,2,3

nvidia-smi showed the correct GPU hardware on the node.

From reading the code, Torc appears to:

  1. derive allocation resources from Slurm startup env
  2. then later override GPU count from visible-device env vars like CUDA_VISIBLE_DEVICES / SLURM_JOB_GPUS

In a multi-node allocation, those env vars appear to be node-local, not allocation-wide. So a single runner in a 2-node allocation can incorrectly collapse the total GPU pool from
8 to 4, which then makes multi-node GPU accounting inconsistent and can trigger the num_gpus >= 0 assertion.

2. Direct mode with one runner cannot use GPUs on other nodes

My understanding after debugging this is:

  • In mode: direct, if start_one_worker_per_node is not enabled, there is one runner for the whole allocation.
  • That runner executes jobs directly on its own host.
  • It does not place jobs on the other nodes in the allocation.

If that understanding is correct, then a single direct-mode runner in a multi-node allocation cannot actually use remote-node GPUs, even if the resource accounting says they
exist.

This means direct mode without start_one_worker_per_node: true underutilizes multi-node GPU allocations.

3. Direct mode may oversubscribe local GPUs when total allocation GPUs > visible local GPUs

Again, if my reading is correct:

  • the scheduler/accounting can think there are 8 GPUs total in the 2-node allocation
  • but the local runner only has 4 visible GPUs on its host
  • after the first 4 local GPU assignments, Torc falls back to GPU reuse / round-robin

If so, direct mode with one runner can assign multiple jobs to the same visible GPU while still believing it is using the full allocation.

4. start_one_worker_per_node helps for single-node GPU jobs, but direct mode still leaks PMI/Slurm task env

When I enabled start_one_worker_per_node: true, I then hit errors like:

[PE_1]:inet_recv:inet_recv: recv error on nid001320 from nid001317 (fd=3) Connection reset by peer
[PE_1]:_pmi_network_barrier:_pmi_inet_recv from target 0 failed pmi errno -1
[PE_1]:control_nets_join:network_barrier failed
agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.

and later:

[PE_0]:inet_listen_socket_setup:bind() failed [fd=3, port=63002 err='Address already in use']
[PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
[PE_0]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
agent: ../src/mpid/common/cray/cray_pmi_utils.c:364: mpid_cray_pmi_init: Assertion `PMI2_Initialized()' failed.

The job command being run by Torc was just:

./agent ./inputs > output

I did not add srun inside the job script.

This suggests that in direct mode, child jobs inherit PMI / PMIx / Slurm task environment from the outer worker launch (for example when per-node workers are launched under srun
--ntasks-per-node=1), and MPI-linked binaries may then try to initialize against the wrong launcher context.

Questions / suspected design limitations

Based on the current behavior, I think the following may be true:

  1. Direct mode should preserve allocation-level GPU counts in multi-node allocations instead of replacing them with node-local visible-device env counts.
  2. Direct mode child jobs should scrub inherited PMI_, PMIX_, and Slurm step/task env before spawn.
  3. Direct mode without start_one_worker_per_node is not a good fit for multi-node GPU allocations, because one runner cannot actually execute on remote nodes.
  4. Direct mode likely cannot correctly support:
    • true multi-node GPU jobs
    • multi-GPU jobs that need GPUs across more than one node

If that understanding is correct, it would be helpful to document explicitly that:

  • mode: direct + start_one_worker_per_node: true is for many single-node jobs spread across nodes
  • mode: slurm is required for true multi-node GPU jobs / MPI-style jobs / jobs that need coordinated launch across nodes

Repro context

  • 2-node Slurm allocation
  • 4 GPUs per node
  • mode: direct
  • observed SLURM_JOB_GPUS=0,1,2,3 and CUDA_VISIBLE_DEVICES=0,1,2,3 on the node where the runner started
  • nvidia-smi showed the expected hardware on the node
  • panic triggered from GPU accounting in src/client/job_runner.rs

Requested outcome

At minimum, I think Torc should:

  • avoid panicking in this scenario
  • handle multi-node direct-mode GPU accounting consistently
  • document direct-mode limitations for multi-node GPU workloads
  • scrub PMI / PMIx / Slurm step env for direct-mode child job launch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions