Skip to content

OCPKUEUE-571: Handle version skew between DRA and Kueue on OCP#1578

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
PannagaRao:add-dra-version-skew
Mar 18, 2026
Merged

OCPKUEUE-571: Handle version skew between DRA and Kueue on OCP#1578
openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
PannagaRao:add-dra-version-skew

Conversation

@PannagaRao
Copy link
Copy Markdown
Contributor

@PannagaRao PannagaRao commented Mar 12, 2026

Signed-off-by: Pannaga Rao Bhoja Ramamanohara

Signed-off-by: Pannaga Rao Bhoja Ramamanohara
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Mar 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: PannagaRao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8ca7256c-9855-433e-a61d-7613b08675f6

📥 Commits

Reviewing files that changed from the base of the PR and between a1140f1 and da49b58.

📒 Files selected for processing (3)
  • pkg/configmap/configmap.go
  • pkg/configmap/configmap_test.go
  • pkg/operator/target_config_reconciler.go

Walkthrough

Adds runtime detection of Dynamic Resource Allocation (DRA) support to TargetConfigReconciler, stores it in a new draSupported field, conditions configmap generation on that flag, and updates BuildConfigMap and tests to accept the new boolean parameter. Missing DRA API triggers warnings/events and omits DRA config.

Changes

Cohort / File(s) Summary
Operator reconciler
pkg/operator/target_config_reconciler.go
Adds draSupported bool to TargetConfigReconciler. Detects resource.k8s.io/v1 DeviceClass API during sync, sets draSupported, emits warning/event and appends a DRA-related degraded status when unavailable. Passes flag into configmap build.
ConfigMap generation
pkg/configmap/configmap.go
Updates signatures: BuildConfigMap(..., draSupported bool), buildFeatureGates(..., draSupported bool), and defaultKueueConfigurationTemplate(..., draSupported bool). Feature gates enable DynamicResourceAllocation only when draSupported is true and deviceClass mappings exist.
Tests
pkg/configmap/configmap_test.go
Adds draSupported field to test cases and updates calls to BuildConfigMap(..., draSupported). Adds test scenarios for DRA enabled vs unsupported clusters and adjusts expected YAML accordingly.

Sequence Diagram(s)

sequenceDiagram
    participant TargetConfigReconciler
    participant K8sDiscovery as K8s API Discovery
    participant ConfigProcessor as Config Processor

    TargetConfigReconciler->>TargetConfigReconciler: Start reconciliation sync
    TargetConfigReconciler->>K8sDiscovery: Check for DeviceClass API (resource.k8s.io/v1)

    alt DRA API Available
        K8sDiscovery-->>TargetConfigReconciler: API found
        TargetConfigReconciler->>TargetConfigReconciler: Set draSupported = true
    else DRA API Not Available
        K8sDiscovery-->>TargetConfigReconciler: API not found
        TargetConfigReconciler->>TargetConfigReconciler: Set draSupported = false
        TargetConfigReconciler->>TargetConfigReconciler: Emit warning & DRAUnsupported event
    end

    TargetConfigReconciler->>ConfigProcessor: Create derived kueueCfg

    alt DRA Not Supported
        ConfigProcessor->>ConfigProcessor: Strip deviceClass resources from config
    end

    TargetConfigReconciler->>ConfigProcessor: Apply (possibly stripped) config to ConfigMap
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Pull request introduces 2352 lines of new reconciler code without corresponding unit tests and existing tests lack meaningful assertion messages and explicit timeouts. Add comprehensive unit tests for TargetConfigReconciler with proper setup/cleanup, explicit timeouts, meaningful failure messages, and single responsibility per test.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed PR modifies only production code files to implement DRA capability detection. No Ginkgo test files or test definitions were added or modified, making the test name stability check not applicable.
Title check ✅ Passed The title 'OCPKUEUE-571: Handle version skew between DRA and Kueue on OCP' clearly and specifically describes the main change: handling version skew between DRA (Dynamic Resource Allocation) and Kueue on OpenShift, which aligns with the changeset that adds DRA capability detection and conditional config stripping based on DRA support status.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@PannagaRao PannagaRao mentioned this pull request Mar 12, 2026
@PannagaRao PannagaRao changed the title Handle version skew between DRA and Kueue on OCP OCPKUEUE-571: Handle version skew between DRA and Kueue on OCP Mar 12, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 12, 2026

@PannagaRao: This pull request references OCPKUEUE-571 which is a valid jira issue.

Details

In response to this:

Signed-off-by: Pannaga Rao Bhoja Ramamanohara

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 12, 2026

@PannagaRao: This pull request references OCPKUEUE-571 which is a valid jira issue.

Details

In response to this:

Signed-off-by: Pannaga Rao Bhoja Ramamanohara

Summary by CodeRabbit

Release Notes

  • New Features
  • System now detects Dynamic Resource Allocation (DRA) support and automatically adjusts device class configurations when DRA APIs are unavailable.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

// Copy the config so we can strip unsupported features without mutating the cached object.
kueueCfg := kueue.Spec.Config
if !c.draSupported {
kueueCfg.Resources = kueuev1.Resources{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider turning off the feature gate for DRA for Kueue in the config map.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that when we zero out Resources, buildFeatureGates() returns nil, so the feature gate is effectively off. But based on @sohankunkerkar feedback below, I'm reconsidering whether we should strip the config at all.

// Copy the config so we can strip unsupported features without mutating the cached object.
kueueCfg := kueue.Spec.Config
if !c.draSupported {
kueueCfg.Resources = kueuev1.Resources{}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Degraded condition and event are good for visibility, but stripping deviceClassMappings from the ConfigMap means the config is lost at runtime. If the cluster later gets upgraded to 4.21, the user has to re-apply their Kueue CR to get DRA working

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we leave the config in the ConfigMap and just have Kueue's controller handle the missing API gracefully (which it already does; Kueue won't crash without DRA APIs, it just won't do DRA quota management). Also worth checking v1beta1 too, not just v1, since OCP 4.19 with TechPreview could have DRA available via the beta API.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll remove the stripping — the config will stay in the configmap. On upgrade to 4.21+, the operator will pick it up on the next reconciliation without the user needing to re-apply.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream kueue imports only k8s.io/api/resource/v1 (the GA API) for all DRA code paths. It doesn't import or use v1beta1 or v1beta2. So even if OCP 4.19 has DRA available via v1beta1 with TechPreview I don't think Kueue can use it. Do you think it will give a false positive if we check v1beta1?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only check v1 in this case.

}
// Check DRA API availability if deviceClassMappings are configured.
// Kueue's DRA integration requires resource.k8s.io/v1 (Kubernetes 1.34+ / OCP 4.21+).
c.draSupported = false
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set unconditionally before the len(deviceClassMappings) > 0 check, so on a cluster that supports DRA but has no mappings configured, draSupported is still false and the manageConfigMap path would zero out Resources regardless.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the code to remove the config stripping from manageConfigMap. The flag is now only used to control the feature gate in buildFeatureGates, which checks len(mappings) > 0 && draSupported — so when no mappings are configured, the flag value doesn't matter. Defaulting it to false to account for discovery errors.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Mar 18, 2026

@PannagaRao: This pull request references OCPKUEUE-571 which is a valid jira issue.

Details

In response to this:

Signed-off-by: Pannaga Rao Bhoja Ramamanohara

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/configmap/configmap_test.go (1)

540-540: Add a draSupported=false case to this table.

Line 540 hardcodes true, so this table never exercises the new unsupported-cluster behavior. That leaves the key regression unchecked: resources.deviceClassMappings should stay rendered while featureGates.DynamicResourceAllocation is omitted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/configmap/configmap_test.go` at line 540, The test call to BuildConfigMap
currently always passes draSupported=true, so add a table case that sets
draSupported=false and assert the unsupported-cluster behavior: call
BuildConfigMap("test", tc.configuration, tc.gvrToKind, false) for that case and
verify that resources.deviceClassMappings remains rendered even when
featureGates.DynamicResourceAllocation is omitted; update the test table (the
case entry) and its assertions to include this draSupported=false scenario to
cover the regression.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/target_config_reconciler.go`:
- Around line 303-317: The discovery error path for isResourceRegistered
currently falls through and treats any error as "DRA unsupported"; change the
logic so that if err != nil you only log the error (klog.Errorf) and do not emit
the DRAUnsupported event, do not append to missingDependencies, and do not flip
c.draSupported; only when err == nil and found == true set c.draSupported =
true, and when err == nil and found == false emit the DRAUnsupported event and
append the missing dependency (use isResourceRegistered, c.discoveryClient,
c.draSupported, c.eventRecorder, missingDependencies to locate and update the
code).

---

Nitpick comments:
In `@pkg/configmap/configmap_test.go`:
- Line 540: The test call to BuildConfigMap currently always passes
draSupported=true, so add a table case that sets draSupported=false and assert
the unsupported-cluster behavior: call BuildConfigMap("test", tc.configuration,
tc.gvrToKind, false) for that case and verify that resources.deviceClassMappings
remains rendered even when featureGates.DynamicResourceAllocation is omitted;
update the test table (the case entry) and its assertions to include this
draSupported=false scenario to cover the regression.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 88e84dc1-6439-4edf-9a1e-c9c81bd5776f

📥 Commits

Reviewing files that changed from the base of the PR and between ea5ccb3 and 78cd7ba.

📒 Files selected for processing (3)
  • pkg/configmap/configmap.go
  • pkg/configmap/configmap_test.go
  • pkg/operator/target_config_reconciler.go

Comment thread pkg/operator/target_config_reconciler.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/configmap/configmap_test.go (1)

397-458: ⚠️ Potential issue | 🟠 Major

Unsupported-DRA test still expects deviceClassMappings to be present.

Line 398 sets draSupported: false, but Lines 450-455 still assert resources.deviceClassMappings in the generated YAML. That validates the opposite behavior for version-skew handling and can mask regressions.

Proposed fix
 namespace: test
-resources:
-  deviceClassMappings:
-  - deviceClassNames:
-    - gpu.example.com
-    - gpu-large.example.com
-    name: example.com/gpus
 webhook:
   port: 9443
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/configmap/configmap_test.go` around lines 397 - 458, The test case with
draSupported: false incorrectly expects resources.deviceClassMappings in the
generated ConfigMap; update the test in pkg/configmap/configmap_test.go by
removing the deviceClassMappings section from the wantCfgMap YAML for that case
(the entry under the test map keyed "dra with device class mappings on
unsupported cluster") so the expected output matches the branch where dra
support is disabled; locate the test case by the draSupported variable and the
wantCfgMap constant used in the test and ensure the YAML under
"controller_manager_config.yaml" no longer contains the
resources.deviceClassMappings block.
🧹 Nitpick comments (1)
pkg/configmap/configmap_test.go (1)

608-614: Check err before diffing YAML output.

At Line 609, got.Data[...] is accessed before validating err. Fail fast on unexpected errors first to avoid brittle failure modes and improve diagnostics.

Suggested refactor
 got, err := BuildConfigMap("test", tc.configuration, tc.gvrToKind, tc.draSupported)
-if diff := cmp.Diff(got.Data["controller_manager_config.yaml"], tc.wantCfgMap.Data["controller_manager_config.yaml"]); len(diff) != 0 {
-	t.Errorf("Unexpected buckets (-want,+got):\n%s", diff)
-}
 if err != nil && tc.wantErr == nil {
-	t.Errorf("Unexpected error: want=%v, got=%v", tc.wantErr, err)
+	t.Fatalf("Unexpected error: want=%v, got=%v", tc.wantErr, err)
+}
+if diff := cmp.Diff(got.Data["controller_manager_config.yaml"], tc.wantCfgMap.Data["controller_manager_config.yaml"]); len(diff) != 0 {
+	t.Errorf("Unexpected buckets (-want,+got):\n%s", diff)
 }

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/configmap/configmap_test.go` around lines 608 - 614, The test accesses
got.Data["controller_manager_config.yaml"] before validating err from
BuildConfigMap; change the order to check err first (e.g., if err != nil &&
tc.wantErr == nil { t.Fatalf("Unexpected error: want=%v, got=%v", tc.wantErr,
err) } or t.Errorf + return) and only then compute the cmp.Diff between
got.Data[...] and tc.wantCfgMap.Data[...] so you fail fast on BuildConfigMap
errors; update the block around variables got, err, tc.wantCfgMap and the
cmp.Diff call in the BuildConfigMap test accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pkg/configmap/configmap_test.go`:
- Around line 397-458: The test case with draSupported: false incorrectly
expects resources.deviceClassMappings in the generated ConfigMap; update the
test in pkg/configmap/configmap_test.go by removing the deviceClassMappings
section from the wantCfgMap YAML for that case (the entry under the test map
keyed "dra with device class mappings on unsupported cluster") so the expected
output matches the branch where dra support is disabled; locate the test case by
the draSupported variable and the wantCfgMap constant used in the test and
ensure the YAML under "controller_manager_config.yaml" no longer contains the
resources.deviceClassMappings block.

---

Nitpick comments:
In `@pkg/configmap/configmap_test.go`:
- Around line 608-614: The test accesses
got.Data["controller_manager_config.yaml"] before validating err from
BuildConfigMap; change the order to check err first (e.g., if err != nil &&
tc.wantErr == nil { t.Fatalf("Unexpected error: want=%v, got=%v", tc.wantErr,
err) } or t.Errorf + return) and only then compute the cmp.Diff between
got.Data[...] and tc.wantCfgMap.Data[...] so you fail fast on BuildConfigMap
errors; update the block around variables got, err, tc.wantCfgMap and the
cmp.Diff call in the BuildConfigMap test accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8c8188d8-1f34-4038-9519-b4488b6c8b30

📥 Commits

Reviewing files that changed from the base of the PR and between 78cd7ba and a1140f1.

📒 Files selected for processing (2)
  • pkg/configmap/configmap_test.go
  • pkg/operator/target_config_reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/operator/target_config_reconciler.go

Signed-off-by: Pannaga Rao Bhoja Ramamanohara
@PannagaRao PannagaRao force-pushed the add-dra-version-skew branch from a1140f1 to da49b58 Compare March 18, 2026 15:51
@PannagaRao
Copy link
Copy Markdown
Contributor Author

/retest

Copy link
Copy Markdown
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 18, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Mar 18, 2026

@PannagaRao: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit d5c08da into openshift:main Mar 18, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants