Skip to content

Conversation

@Priyankasaggu11929
Copy link

@Priyankasaggu11929 Priyankasaggu11929 commented Oct 23, 2025

Motivation

This PR aim at adding support for SUSE Linux Enterprise Server (SLES) 15 SP5+ to the AMD GPU operator.

Technical Details

  • 781c5b5 - add support for detecting SLES nodes and automatically selecting appropriate AMD GPU driver versions

  • 0170a9a - add SLES Dockerfile template (DockerfileTemplate.sles) for building AMD GPU drivers on SLES (currently, I've skipped adding the GIM Dockerfile template for SLES, will tackle it once this goes through).

    • also embed the template via go:embed and add SLES case logic
  • c2dce44 - docs: update example/deviceconfig_example.yaml <- dropped

  • 4da60d3 - use "registry.suse.com" as the default base image registry if OS == "sles"

    • although, use-specified BaseImageRegistry still takes precedence
    • also extend tests in internal/kmmodule/kmmodule_test.go to test above changes in resolveDockerfile func

Test Plan

  • b625441 - tests: update internal/utils_test.go for added support for SLES 15 SP*

Test Result

  • truncated output of make unit-test after new added tests in b625441

    > make unit-test
    ...
    ...
    === RUN   TestSLESDefaultDriverVersionsMapper
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP6
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP7
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP5
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_SP4
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_base
    === RUN   TestSLESDefaultDriverVersionsMapper/SLES_15_with_dash_format
    --- PASS: TestSLESDefaultDriverVersionsMapper (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP6 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP7 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP5 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_SP4 (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_base (0.00s)
        --- PASS: TestSLESDefaultDriverVersionsMapper/SLES_15_with_dash_format (0.00s)
    PASS
    coverage: 48.6% of statements
    ok  	github.com/ROCm/gpu-operator/internal	0.019s	coverage: 48.6% of statements
    === RUN   TestAPIs
    Running Suite: Controller Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/controllers
    ==========================================================================================================================
    Random Seed: 1761223798
    
    Will run 15 of 15 specs
    •••••••••••••••
    
    Ran 15 of 15 Specs in 0.008 seconds
    SUCCESS! -- 15 Passed | 0 Failed | 0 Pending | 0 Skipped
    --- PASS: TestAPIs (0.01s)
    PASS
    coverage: 7.9% of statements
    ok  	github.com/ROCm/gpu-operator/internal/controllers	(cached)	coverage: 7.9% of statements
    === RUN   TestAPIs
    Running Suite: KMMModule Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule
    =======================================================================================================================
    Random Seed: 1761223798
    
    Will run 5 of 5 specs
    testing multiple valid homogeneous nodes
    testing multiple valid heterogeneous nodes
    testing multiple valid heterogeneous nodes + one unsupported node
    testing multiple unsupported nodes
    testing empty node list
    •<moduleName>
    <amdgpu>
    •<moduleName>
    <amdgpu>
    •••
    
    Ran 5 of 5 Specs in 0.005 seconds
    SUCCESS! -- 5 Passed | 0 Failed | 0 Pending | 0 Skipped
    --- PASS: TestAPIs (0.01s)
    PASS
    coverage: 32.3% of statements
    ok  	github.com/ROCm/gpu-operator/internal/kmmmodule	(cached)	coverage: 32.3% of statements
    
    •••••••••••••••
    
    Ran 15 of 15 Specs in 0.008 seconds
    SUCCESS! -- 15 Passed | 0 Failed | 0 Pending | 0 Skipped
    

  • output from tests added as part of 4da60d3

    ❯ go test ./internal/kmmmodule/... -v -ginkgo.focus="resolveDockerfile" -ginkgo.v
    === RUN   TestAPIs
    Running Suite: KMMModule Suite - /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule
    =======================================================================================================================
    Random Seed: 1761548380
    
    Will run 3 of 8 specs
    SSSS
    ------------------------------
    resolveDockerfile should use correct default registry when not specified by user
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:683
    • [0.000 seconds]
    ------------------------------
    resolveDockerfile should respect user-specified BaseImageRegistry for all OS types
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:702
    • [0.000 seconds]
    ------------------------------
    resolveDockerfile should return error for unsupported OS
    /home/psaggu/work-suse/amd-gpu-operator-work/oct-23-pr/gpu-operator/internal/kmmmodule/kmmmodule_test.go:727
    • [0.000 seconds]
    ------------------------------
    S
    
    Ran 3 of 8 Specs in 0.000 seconds
    SUCCESS! -- 3 Passed | 0 Failed | 0 Pending | 5 Skipped
    --- PASS: TestAPIs (0.00s)
    PASS
    ok  	github.com/ROCm/gpu-operator/internal/kmmmodule	0.022s
    

Submission Checklist

Priyankasaggu11929 and others added 2 commits October 23, 2025 18:30
…opriate AMD GPU driver versions

* add new `slesCMNameMapper` to parse SLES version strings like 'SUSE Linux Enterprise Server 15 SP6' to 'sles-15.6'
* add `SLESDefaultDriverVersionsMapper` to select driver versions
  - SLES 15 SP6/SP7 -> driver 7.0.2 (ref: https://repo.radeon.com/amdgpu-install/7.0.2/sle/)
  - SLES 15 SP5 -> driver 6.2.2 (ref: https://repo.radeon.com/amdgpu-install/6.2.2/sle/)
* register both 'sles' and 'suse' identifiers in mappers

Co-authored-by: alex-isv <[email protected]>
… AMD GPU drivers on SLES

* also embed the template via go:embed and add SLES case logic

Co-authored-by: alex-isv <[email protected]>
@Priyankasaggu11929
Copy link
Author

Hello @yansun1996, I’ve opened this PR to get early feedback on the approach for adding support for SLES 15 SP6/SP7.
Please review and let me know if/where any changes are needed.

Also please note - I haven’t tested these changes yet on a SLES 15 host with an AMD GPU. That is in works!

@yansun1996
Copy link
Member

yansun1996 commented Oct 23, 2025

Hello @yansun1996, I’ve opened this PR to get early feedback on the approach for adding support for SLES 15 SP6/SP7. Please review and let me know if/where any changes are needed.

Also please note - I haven’t tested these changes yet on a SLES 15 host with an AMD GPU. That is in works!

Hi @Priyankasaggu11929 thanks for raising the PR, we will review this PR.

Please also let us know when you did some verification on the real AMD GPU hardware based cluster. thanks !

@Priyankasaggu11929
Copy link
Author

Hi @Priyankasaggu11929 thanks for raising the PR, we will review this PR.
Please also let us know when you did some verification on the real AMD GPU hardware based cluster. thanks !

Yes, I'll keep posting updates. Thank you!

# IMPORTANT for SLES: Base images must come from registry.suse.com
# Uncomment and set for SLES 15 SP5/SP6 deployments:
#imageBuild:
# baseImageRegistry: "registry.suse.com"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor suggestion,

since the controller will be able to parse the OS image and detect that the workers are SLES based, you can let the controller set the baseImageRegistry for the detected SLES based worker nodes.

PTAL at this function resolveDockerfile

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in 4da60d3

set default to "registry.suse.com" in case of OS == "sles" but still giving precedence if a user defines spec.driver.imageBuild.baseImageRegistry = "custom-image-regisry". I added some minor tests to verify the behavior.

With above, I dropped the docs changes in example/deviceconfig_example.yaml

Please review again. Thank you!

Copy link
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor suggestion, the rest of the PR looks good
Let us know when you finished the verification with hardware

…sles"

* although, use-specified `BaseImageRegistry` still takes precedence

* also extend tests in `internal/kmmodule/kmmodule_test.go` to test above changes in `resolveDockerfile` func
Copy link
Member

@yansun1996 yansun1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @Priyankasaggu11929 good job, please open another same PR against the staging branch, we're managing PR in this way staging ---> main ---> release-vx.x.x

once you confirmed the verification on AMD GPU setup is done, we can discuss with product team about further details for a release plan with SLES support

@Priyankasaggu11929
Copy link
Author

thanks @Priyankasaggu11929 good job, please open another same PR against the staging branch, we're managing PR in this way staging ---> main ---> release-vx.x.x

Created PR for staging branch - #371

once you confirmed the verification on AMD GPU setup is done, we can discuss with product team about further details for a release plan with SLES support

Thank you so much!

Regarding "the verification on AMD GPU setup" - I'm still in discussion for getting the required lab infra access, so there are no updates as of now on this, but I will post updates as soon as I am able to run some tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants