Support automatic GRES configuration for NVIDIA GPUs #820

sjpb · 2025-10-09T19:13:21Z

For nodes with non-MIG/vGPU NVIDIA GPUs and an image built including the cuda role, GRES can now be configured by setting:
```
# environments/site/inventory/group_vars/all/openhpc.yml:
openhpc_gres_autodetect: nvml
```
Note that:
- Setting GresTypes in openhpc_config_extra is no longer ever required.
- For nvml autodectection (only), the conf option in gres entries in openhpc_nodegroups is no longer required.
Enables nvml autodetection for the .caas environment, so Azimuth Slurm clusters only need an appropriate image built to autoconfigure NVIDIA GPUs.

For full details see stackhpc/ansible-role-openhpc#202.

sjpb · 2025-10-10T08:32:43Z

CI failed b/c this needs to be rebased on top of #818.

sjpb · 2025-10-22T11:31:09Z

Ok tested using azimuth-config @ f68771d2255656b9bac4a3fe43943e852bf1c014, using demo environment with:

# environments/demo/inventory/group_vars/all/overrides.yml:
# save time, we're only going to be testing Slurm:
community_images_default: {}
harbor_enabled: false
azimuth_caas_stackhpc_slurm_appliance_enabled: true
azimuth_caas_repo2docker_enabled: false
azimuth_caas_stackhpc_workstation_enabled: false
azimuth_caas_stackhpc_rstudio_enabled: false

# use cuda image (no nvidia-fabricmanager though):
azimuth_caas_stackhpc_slurm_appliance_image: d704b6e1-7cbd-4f79-b043-c53a342fb9a3 # openhpc-250917-1009-b0fc55a4 with cuda
azimuth_caas_stackhpc_slurm_appliance_git_version: feat/auto-gres #feat/caas-compute-vnic-types

With this:

[azimuth@slurm-v4-login-0 ~]$ scontrol show node
NodeName=slurm-v4-compute-standard-0 Arch=x86_64 CoresPerSocket=64 
   CPUAlloc=0 CPUEfctv=256 CPUTot=256 CPULoad=0.05
   AvailableFeatures=nodegroup_standard
   ActiveFeatures=nodegroup_standard
   Gres=gpu:H200:8(S:0-1)

and:

Starting a session using srun --pty bash -i shows no GPUs with nvidia-smi
Starting a session using srun --pty --gres gpu:H200:1 bash -i shows 1 GPU with nvidia-smi, similar for 2.

wip - bump openhpc role for testing

9f53fd8

sjpb force-pushed the feat/auto-gres branch from b0e5499 to 9f53fd8 Compare October 10, 2025 10:01

sjpb marked this pull request as ready for review October 10, 2025 11:15

sjpb requested a review from a team as a code owner October 10, 2025 11:15

sjpb changed the title ~~wip - bump openhpc role for testing~~ Support automatic GRES configuration for NVIDIA GPUs Oct 10, 2025

sjpb mentioned this pull request Oct 10, 2025

Support fully-automatic GRES configuration for nvml stackhpc/ansible-role-openhpc#202

Open

sjpb marked this pull request as draft October 10, 2025 13:51

sjpb added 3 commits October 10, 2025 14:04

remove GresTypes from MIG docs

d23904d

enable nvml autoconfiguration for CaaS

67c93f0

fix linter problems

0ed4fab

sjpb marked this pull request as ready for review October 22, 2025 11:36

Merge branch 'main' into feat/auto-gres

d6ffdaf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support automatic GRES configuration for NVIDIA GPUs #820

Support automatic GRES configuration for NVIDIA GPUs #820

Uh oh!

sjpb commented Oct 9, 2025 •

edited

Loading

Uh oh!

sjpb commented Oct 10, 2025

Uh oh!

sjpb commented Oct 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support automatic GRES configuration for NVIDIA GPUs #820

Are you sure you want to change the base?

Support automatic GRES configuration for NVIDIA GPUs #820

Uh oh!

Conversation

sjpb commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjpb commented Oct 10, 2025

Uh oh!

sjpb commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sjpb commented Oct 9, 2025 •

edited

Loading

sjpb commented Oct 22, 2025 •

edited

Loading