OCPKUEUE-571: Handle version skew between DRA and Kueue on OCP#1578
Conversation
Signed-off-by: Pannaga Rao Bhoja Ramamanohara
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: PannagaRao The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughAdds runtime detection of Dynamic Resource Allocation (DRA) support to TargetConfigReconciler, stores it in a new Changes
Sequence Diagram(s)sequenceDiagram
participant TargetConfigReconciler
participant K8sDiscovery as K8s API Discovery
participant ConfigProcessor as Config Processor
TargetConfigReconciler->>TargetConfigReconciler: Start reconciliation sync
TargetConfigReconciler->>K8sDiscovery: Check for DeviceClass API (resource.k8s.io/v1)
alt DRA API Available
K8sDiscovery-->>TargetConfigReconciler: API found
TargetConfigReconciler->>TargetConfigReconciler: Set draSupported = true
else DRA API Not Available
K8sDiscovery-->>TargetConfigReconciler: API not found
TargetConfigReconciler->>TargetConfigReconciler: Set draSupported = false
TargetConfigReconciler->>TargetConfigReconciler: Emit warning & DRAUnsupported event
end
TargetConfigReconciler->>ConfigProcessor: Create derived kueueCfg
alt DRA Not Supported
ConfigProcessor->>ConfigProcessor: Strip deviceClass resources from config
end
TargetConfigReconciler->>ConfigProcessor: Apply (possibly stripped) config to ConfigMap
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@PannagaRao: This pull request references OCPKUEUE-571 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@PannagaRao: This pull request references OCPKUEUE-571 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| // Copy the config so we can strip unsupported features without mutating the cached object. | ||
| kueueCfg := kueue.Spec.Config | ||
| if !c.draSupported { | ||
| kueueCfg.Resources = kueuev1.Resources{} |
There was a problem hiding this comment.
I think we should consider turning off the feature gate for DRA for Kueue in the config map.
There was a problem hiding this comment.
I was thinking that when we zero out Resources, buildFeatureGates() returns nil, so the feature gate is effectively off. But based on @sohankunkerkar feedback below, I'm reconsidering whether we should strip the config at all.
| // Copy the config so we can strip unsupported features without mutating the cached object. | ||
| kueueCfg := kueue.Spec.Config | ||
| if !c.draSupported { | ||
| kueueCfg.Resources = kueuev1.Resources{} |
There was a problem hiding this comment.
The Degraded condition and event are good for visibility, but stripping deviceClassMappings from the ConfigMap means the config is lost at runtime. If the cluster later gets upgraded to 4.21, the user has to re-apply their Kueue CR to get DRA working
There was a problem hiding this comment.
I'd prefer we leave the config in the ConfigMap and just have Kueue's controller handle the missing API gracefully (which it already does; Kueue won't crash without DRA APIs, it just won't do DRA quota management). Also worth checking v1beta1 too, not just v1, since OCP 4.19 with TechPreview could have DRA available via the beta API.
There was a problem hiding this comment.
Good point. I'll remove the stripping — the config will stay in the configmap. On upgrade to 4.21+, the operator will pick it up on the next reconciliation without the user needing to re-apply.
There was a problem hiding this comment.
Upstream kueue imports only k8s.io/api/resource/v1 (the GA API) for all DRA code paths. It doesn't import or use v1beta1 or v1beta2. So even if OCP 4.19 has DRA available via v1beta1 with TechPreview I don't think Kueue can use it. Do you think it will give a false positive if we check v1beta1?
There was a problem hiding this comment.
We can only check v1 in this case.
| } | ||
| // Check DRA API availability if deviceClassMappings are configured. | ||
| // Kueue's DRA integration requires resource.k8s.io/v1 (Kubernetes 1.34+ / OCP 4.21+). | ||
| c.draSupported = false |
There was a problem hiding this comment.
This is set unconditionally before the len(deviceClassMappings) > 0 check, so on a cluster that supports DRA but has no mappings configured, draSupported is still false and the manageConfigMap path would zero out Resources regardless.
There was a problem hiding this comment.
Changed the code to remove the config stripping from manageConfigMap. The flag is now only used to control the feature gate in buildFeatureGates, which checks len(mappings) > 0 && draSupported — so when no mappings are configured, the flag value doesn't matter. Defaulting it to false to account for discovery errors.
|
@PannagaRao: This pull request references OCPKUEUE-571 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/configmap/configmap_test.go (1)
540-540: Add adraSupported=falsecase to this table.Line 540 hardcodes
true, so this table never exercises the new unsupported-cluster behavior. That leaves the key regression unchecked:resources.deviceClassMappingsshould stay rendered whilefeatureGates.DynamicResourceAllocationis omitted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/configmap/configmap_test.go` at line 540, The test call to BuildConfigMap currently always passes draSupported=true, so add a table case that sets draSupported=false and assert the unsupported-cluster behavior: call BuildConfigMap("test", tc.configuration, tc.gvrToKind, false) for that case and verify that resources.deviceClassMappings remains rendered even when featureGates.DynamicResourceAllocation is omitted; update the test table (the case entry) and its assertions to include this draSupported=false scenario to cover the regression.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/target_config_reconciler.go`:
- Around line 303-317: The discovery error path for isResourceRegistered
currently falls through and treats any error as "DRA unsupported"; change the
logic so that if err != nil you only log the error (klog.Errorf) and do not emit
the DRAUnsupported event, do not append to missingDependencies, and do not flip
c.draSupported; only when err == nil and found == true set c.draSupported =
true, and when err == nil and found == false emit the DRAUnsupported event and
append the missing dependency (use isResourceRegistered, c.discoveryClient,
c.draSupported, c.eventRecorder, missingDependencies to locate and update the
code).
---
Nitpick comments:
In `@pkg/configmap/configmap_test.go`:
- Line 540: The test call to BuildConfigMap currently always passes
draSupported=true, so add a table case that sets draSupported=false and assert
the unsupported-cluster behavior: call BuildConfigMap("test", tc.configuration,
tc.gvrToKind, false) for that case and verify that resources.deviceClassMappings
remains rendered even when featureGates.DynamicResourceAllocation is omitted;
update the test table (the case entry) and its assertions to include this
draSupported=false scenario to cover the regression.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 88e84dc1-6439-4edf-9a1e-c9c81bd5776f
📒 Files selected for processing (3)
pkg/configmap/configmap.gopkg/configmap/configmap_test.gopkg/operator/target_config_reconciler.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/configmap/configmap_test.go (1)
397-458:⚠️ Potential issue | 🟠 MajorUnsupported-DRA test still expects
deviceClassMappingsto be present.Line 398 sets
draSupported: false, but Lines 450-455 still assertresources.deviceClassMappingsin the generated YAML. That validates the opposite behavior for version-skew handling and can mask regressions.Proposed fix
namespace: test -resources: - deviceClassMappings: - - deviceClassNames: - - gpu.example.com - - gpu-large.example.com - name: example.com/gpus webhook: port: 9443🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/configmap/configmap_test.go` around lines 397 - 458, The test case with draSupported: false incorrectly expects resources.deviceClassMappings in the generated ConfigMap; update the test in pkg/configmap/configmap_test.go by removing the deviceClassMappings section from the wantCfgMap YAML for that case (the entry under the test map keyed "dra with device class mappings on unsupported cluster") so the expected output matches the branch where dra support is disabled; locate the test case by the draSupported variable and the wantCfgMap constant used in the test and ensure the YAML under "controller_manager_config.yaml" no longer contains the resources.deviceClassMappings block.
🧹 Nitpick comments (1)
pkg/configmap/configmap_test.go (1)
608-614: Checkerrbefore diffing YAML output.At Line 609,
got.Data[...]is accessed before validatingerr. Fail fast on unexpected errors first to avoid brittle failure modes and improve diagnostics.Suggested refactor
got, err := BuildConfigMap("test", tc.configuration, tc.gvrToKind, tc.draSupported) -if diff := cmp.Diff(got.Data["controller_manager_config.yaml"], tc.wantCfgMap.Data["controller_manager_config.yaml"]); len(diff) != 0 { - t.Errorf("Unexpected buckets (-want,+got):\n%s", diff) -} if err != nil && tc.wantErr == nil { - t.Errorf("Unexpected error: want=%v, got=%v", tc.wantErr, err) + t.Fatalf("Unexpected error: want=%v, got=%v", tc.wantErr, err) +} +if diff := cmp.Diff(got.Data["controller_manager_config.yaml"], tc.wantCfgMap.Data["controller_manager_config.yaml"]); len(diff) != 0 { + t.Errorf("Unexpected buckets (-want,+got):\n%s", diff) }As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/configmap/configmap_test.go` around lines 608 - 614, The test accesses got.Data["controller_manager_config.yaml"] before validating err from BuildConfigMap; change the order to check err first (e.g., if err != nil && tc.wantErr == nil { t.Fatalf("Unexpected error: want=%v, got=%v", tc.wantErr, err) } or t.Errorf + return) and only then compute the cmp.Diff between got.Data[...] and tc.wantCfgMap.Data[...] so you fail fast on BuildConfigMap errors; update the block around variables got, err, tc.wantCfgMap and the cmp.Diff call in the BuildConfigMap test accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@pkg/configmap/configmap_test.go`:
- Around line 397-458: The test case with draSupported: false incorrectly
expects resources.deviceClassMappings in the generated ConfigMap; update the
test in pkg/configmap/configmap_test.go by removing the deviceClassMappings
section from the wantCfgMap YAML for that case (the entry under the test map
keyed "dra with device class mappings on unsupported cluster") so the expected
output matches the branch where dra support is disabled; locate the test case by
the draSupported variable and the wantCfgMap constant used in the test and
ensure the YAML under "controller_manager_config.yaml" no longer contains the
resources.deviceClassMappings block.
---
Nitpick comments:
In `@pkg/configmap/configmap_test.go`:
- Around line 608-614: The test accesses
got.Data["controller_manager_config.yaml"] before validating err from
BuildConfigMap; change the order to check err first (e.g., if err != nil &&
tc.wantErr == nil { t.Fatalf("Unexpected error: want=%v, got=%v", tc.wantErr,
err) } or t.Errorf + return) and only then compute the cmp.Diff between
got.Data[...] and tc.wantCfgMap.Data[...] so you fail fast on BuildConfigMap
errors; update the block around variables got, err, tc.wantCfgMap and the
cmp.Diff call in the BuildConfigMap test accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8c8188d8-1f34-4038-9519-b4488b6c8b30
📒 Files selected for processing (2)
pkg/configmap/configmap_test.gopkg/operator/target_config_reconciler.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/operator/target_config_reconciler.go
Signed-off-by: Pannaga Rao Bhoja Ramamanohara
a1140f1 to
da49b58
Compare
|
/retest |
|
@PannagaRao: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Signed-off-by: Pannaga Rao Bhoja Ramamanohara