Skip to content

Conversation

@sjpb
Copy link
Collaborator

@sjpb sjpb commented Oct 9, 2025

  1. For nodes with non-MIG/vGPU NVIDIA GPUs and an image built including the cuda role, GRES can now be configured by setting:

    # environments/site/inventory/group_vars/all/openhpc.yml:
    openhpc_gres_autodetect: nvml

    Note that:

    • Setting GresTypes in openhpc_config_extra is no longer ever required.
    • For nvml autodectection (only), the conf option in gres entries in openhpc_nodegroups is no longer required.
  2. Enables nvml autodetection for the .caas environment, so Azimuth Slurm clusters only need an appropriate image built to autoconfigure NVIDIA GPUs.

For full details see stackhpc/ansible-role-openhpc#202.

@sjpb
Copy link
Collaborator Author

sjpb commented Oct 10, 2025

CI failed b/c this needs to be rebased on top of #818.

@sjpb sjpb marked this pull request as ready for review October 10, 2025 11:15
@sjpb sjpb requested a review from a team as a code owner October 10, 2025 11:15
@sjpb sjpb changed the title wip - bump openhpc role for testing Support automatic GRES configuration for NVIDIA GPUs Oct 10, 2025
@sjpb sjpb marked this pull request as draft October 10, 2025 13:51
@sjpb
Copy link
Collaborator Author

sjpb commented Oct 22, 2025

Ok tested using azimuth-config @ f68771d2255656b9bac4a3fe43943e852bf1c014, using demo environment with:

# environments/demo/inventory/group_vars/all/overrides.yml:
# save time, we're only going to be testing Slurm:
community_images_default: {}
harbor_enabled: false
azimuth_caas_stackhpc_slurm_appliance_enabled: true
azimuth_caas_repo2docker_enabled: false
azimuth_caas_stackhpc_workstation_enabled: false
azimuth_caas_stackhpc_rstudio_enabled: false

# use cuda image (no nvidia-fabricmanager though):
azimuth_caas_stackhpc_slurm_appliance_image: d704b6e1-7cbd-4f79-b043-c53a342fb9a3 # openhpc-250917-1009-b0fc55a4 with cuda
azimuth_caas_stackhpc_slurm_appliance_git_version: feat/auto-gres #feat/caas-compute-vnic-types

With this:

[azimuth@slurm-v4-login-0 ~]$ scontrol show node
NodeName=slurm-v4-compute-standard-0 Arch=x86_64 CoresPerSocket=64 
   CPUAlloc=0 CPUEfctv=256 CPUTot=256 CPULoad=0.05
   AvailableFeatures=nodegroup_standard
   ActiveFeatures=nodegroup_standard
   Gres=gpu:H200:8(S:0-1)

and:

  • Starting a session using srun --pty bash -i shows no GPUs with nvidia-smi
  • Starting a session using srun --pty --gres gpu:H200:1 bash -i shows 1 GPU with nvidia-smi, similar for 2.

@sjpb sjpb marked this pull request as ready for review October 22, 2025 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant