-
Notifications
You must be signed in to change notification settings - Fork 38
Add SLES support for AMD gpu-operator #365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add SLES support for AMD gpu-operator #365
Conversation
…opriate AMD GPU driver versions * add new `slesCMNameMapper` to parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP6' to 'sles-15.6' * add `SLESDefaultDriverVersionsMapper` to select driver versions - SLES 15 SP6/SP7 -> driver 7.0.2 (ref: https://repo.radeon.com/amdgpu-install/7.0.2/sle/) - SLES 15 SP5 -> driver 6.2.2 (ref: https://repo.radeon.com/amdgpu-install/6.2.2/sle/) * register both 'sles' and 'suse' identifiers in mappers Co-authored-by: alex-isv <[email protected]>
… AMD GPU drivers on SLES * also embed the template via go:embed and add SLES case logic Co-authored-by: alex-isv <[email protected]>
|
Hello @yansun1996, I’ve opened this PR to get early feedback on the approach for adding support for SLES 15 SP6/SP7. Also please note - I haven’t tested these changes yet on a SLES 15 host with an AMD GPU. That is in works! |
Hi @Priyankasaggu11929 thanks for raising the PR, we will review this PR. Please also let us know when you did some verification on the real AMD GPU hardware based cluster. thanks ! |
Yes, I'll keep posting updates. Thank you! |
example/deviceconfig_example.yaml
Outdated
| # IMPORTANT for SLES: Base images must come from registry.suse.com | ||
| # Uncomment and set for SLES 15 SP5/SP6 deployments: | ||
| #imageBuild: | ||
| # baseImageRegistry: "registry.suse.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor suggestion,
since the controller will be able to parse the OS image and detect that the workers are SLES based, you can let the controller set the baseImageRegistry for the detected SLES based worker nodes.
PTAL at this function resolveDockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in 4da60d3
set default to "registry.suse.com" in case of OS == "sles" but still giving precedence if a user defines spec.driver.imageBuild.baseImageRegistry = "custom-image-regisry". I added some minor tests to verify the behavior.
With above, I dropped the docs changes in example/deviceconfig_example.yaml
Please review again. Thank you!
yansun1996
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor suggestion, the rest of the PR looks good
Let us know when you finished the verification with hardware
…sles" * although, use-specified `BaseImageRegistry` still takes precedence * also extend tests in `internal/kmmodule/kmmodule_test.go` to test above changes in `resolveDockerfile` func
d46ce29 to
4da60d3
Compare
yansun1996
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @Priyankasaggu11929 good job, please open another same PR against the staging branch, we're managing PR in this way staging ---> main ---> release-vx.x.x
once you confirmed the verification on AMD GPU setup is done, we can discuss with product team about further details for a release plan with SLES support
Created PR for staging branch - #371
Thank you so much! Regarding "the verification on AMD GPU setup" - I'm still in discussion for getting the required lab infra access, so there are no updates as of now on this, but I will post updates as soon as I am able to run some tests. |
Motivation
This PR aim at adding support for SUSE Linux Enterprise Server (SLES) 15 SP5+ to the AMD GPU operator.
Technical Details
781c5b5 - add support for detecting SLES nodes and automatically selecting appropriate AMD GPU driver versions
slesCMNameMapperto parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP6' to 'sles-15.6'SLESDefaultDriverVersionsMapperto select driver versions0170a9a - add SLES Dockerfile template (
DockerfileTemplate.sles) for building AMD GPU drivers on SLES (currently, I've skipped adding the GIM Dockerfile template for SLES, will tackle it once this goes through).c2dce44 - docs: update example/deviceconfig_example.yaml<- dropped4da60d3 - use "registry.suse.com" as the default base image registry if OS == "sles"
BaseImageRegistrystill takes precedenceinternal/kmmodule/kmmodule_test.goto test above changes inresolveDockerfilefuncTest Plan
Test Result
truncated output of
make unit-testafter new added tests in b625441output from tests added as part of 4da60d3
Submission Checklist