-
Notifications
You must be signed in to change notification settings - Fork 37
Adds bandwidth.yml playbook for NVIDIA nvbandwidth #834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bandwidth.yml is ran via cudatests.yml
454756e to
cc61ed3
Compare
| gather_facts: true | ||
| tags: cuda_samples | ||
| tasks: | ||
| - ansible.builtin.import_role: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we don't even run devicequery, I think we should just remove this task entirely TBH. But leave the role pending thinking more!
ansible/roles/cuda/defaults/main.yml
Outdated
| cuda_persistenced_state: started | ||
| # variables for nvbandwidth (for bandwidth.yml tasks run in cudatests.yml) | ||
| cuda_bandwidth_path: "/var/lib/{{ ansible_user }}/cuda_bandwidth" | ||
| cuda_bandwidth_release_url: "https://github.com/NVIDIA/nvbandwidth/archive/refs/tags/v0.8.tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather break the version out here and then use that var in the creates: on the "Download CUDA bandwith test release" task.
|
|
||
| - name: Build CUDA bandwidth test | ||
| ansible.builtin.shell: | ||
| cmd: source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module load Boost/1.82.0-GCC-12.3.0 && . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this more readably using one of the many multiline yaml options, e.g.:
| cmd: source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module load Boost/1.82.0-GCC-12.3.0 && . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }} | |
| cmd: >- | |
| source /cvmfs/software.eessi.io/versions/2023.06/init/bash && | |
| module load Boost/1.82.0-GCC-12.3.0 && | |
| . /etc/profile.d/sh.local&& | |
| cmake ..&& | |
| make -j {{ ansible_processor_vcpus }} |
| chdir: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build" | ||
| creates: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/nvbandwidth" | ||
|
|
||
| - name: Run CUDA bandwidth test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs changed_when: true to subdue the check error
| @@ -1,3 +1,3 @@ | |||
| --- | |||
| - hosts: cuda | |||
| become: true | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need become: true now? See below too re. path.
| group: "{{ ansible_user }}" | ||
| creates: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8" | ||
|
|
||
| - name: Creates CUDA bandwidth test build directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make the name: consistent with the name: on the first task please?
| chdir: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/" | ||
| register: cuda_bandwidth_output | ||
|
|
||
| - name: Save CUDA bandwidth output to bandwidth_results.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no useful summary we can do here?
| - name: Save CUDA bandwidth output to bandwidth_results.txt | ||
| ansible.builtin.copy: | ||
| content: "{{ cuda_bandwidth_output.stdout }}" | ||
| dest: "{{ appliances_environment_root }}/cudatests/bandwidth_results.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When cuda group contains multiple nodes they will all write to the same file.
| register: cuda_bandwidth_output | ||
|
|
||
| - name: Save CUDA bandwidth output to bandwidth_results.txt | ||
| ansible.builtin.copy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is fetching a file, why does this not use ansible.builtin.fetch?
| creates: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/nvbandwidth" | ||
|
|
||
| - name: Run CUDA bandwidth test | ||
| ansible.builtin.shell: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So:
- Rather than using
exportyou can use theenvironmentkeyword - docs - Why do we have to mess with LD_LIBRARY_PATH?
- If we really do, this approach won't work b/c it is e.g. hardcoding the microarch (zen4), which will definitely break (e.g. when using an Intel processor), and versions, which doesn't seem robust.
Is it not sufficent to just activate eessi again? And maybe load some eeesi modules?
Modifies
ansible/adhoc/cudatests.ymlto run the NVIDIA nvbandwidth test. This replaces the olderbandwidthTestCUDA Samples utility removed in #687.