Skip to content

Conversation

@claudia-lola
Copy link
Contributor

@claudia-lola claudia-lola commented Oct 22, 2025

Modifies ansible/adhoc/cudatests.yml to run the NVIDIA nvbandwidth test. This replaces the older bandwidthTest CUDA Samples utility removed in #687.

@claudia-lola claudia-lola requested a review from a team as a code owner October 22, 2025 14:59
@claudia-lola claudia-lola self-assigned this Oct 22, 2025
@sjpb sjpb changed the title Adds bandwidth.yml playbook to download, build, and run nvbandwidth. Adds bandwidth.yml playbook for NVIDIA nvbandwidth Oct 22, 2025
gather_facts: true
tags: cuda_samples
tasks:
- ansible.builtin.import_role:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we don't even run devicequery, I think we should just remove this task entirely TBH. But leave the role pending thinking more!

cuda_persistenced_state: started
# variables for nvbandwidth (for bandwidth.yml tasks run in cudatests.yml)
cuda_bandwidth_path: "/var/lib/{{ ansible_user }}/cuda_bandwidth"
cuda_bandwidth_release_url: "https://github.com/NVIDIA/nvbandwidth/archive/refs/tags/v0.8.tar.gz"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather break the version out here and then use that var in the creates: on the "Download CUDA bandwith test release" task.


- name: Build CUDA bandwidth test
ansible.builtin.shell:
cmd: source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module load Boost/1.82.0-GCC-12.3.0 && . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this more readably using one of the many multiline yaml options, e.g.:

Suggested change
cmd: source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module load Boost/1.82.0-GCC-12.3.0 && . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }}
cmd: >-
source /cvmfs/software.eessi.io/versions/2023.06/init/bash &&
module load Boost/1.82.0-GCC-12.3.0 &&
. /etc/profile.d/sh.local&&
cmake ..&&
make -j {{ ansible_processor_vcpus }}

chdir: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build"
creates: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/nvbandwidth"

- name: Run CUDA bandwidth test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs changed_when: true to subdue the check error

@@ -1,3 +1,3 @@
---
- hosts: cuda
become: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need become: true now? See below too re. path.

group: "{{ ansible_user }}"
creates: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8"

- name: Creates CUDA bandwidth test build directory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the name: consistent with the name: on the first task please?

chdir: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/"
register: cuda_bandwidth_output

- name: Save CUDA bandwidth output to bandwidth_results.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no useful summary we can do here?

- name: Save CUDA bandwidth output to bandwidth_results.txt
ansible.builtin.copy:
content: "{{ cuda_bandwidth_output.stdout }}"
dest: "{{ appliances_environment_root }}/cudatests/bandwidth_results.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cuda group contains multiple nodes they will all write to the same file.

register: cuda_bandwidth_output

- name: Save CUDA bandwidth output to bandwidth_results.txt
ansible.builtin.copy:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is fetching a file, why does this not use ansible.builtin.fetch?

creates: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/nvbandwidth"

- name: Run CUDA bandwidth test
ansible.builtin.shell: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So:

  1. Rather than using export you can use the environment keyword - docs
  2. Why do we have to mess with LD_LIBRARY_PATH?
  3. If we really do, this approach won't work b/c it is e.g. hardcoding the microarch (zen4), which will definitely break (e.g. when using an Intel processor), and versions, which doesn't seem robust.

Is it not sufficent to just activate eessi again? And maybe load some eeesi modules?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants