cluster-init: continue processing remaining clusters on error by joepvd · Pull Request #4996 · openshift/ci-tools

joepvd · 2026-03-07T14:05:59Z

Summary

The cluster-init onboard config update command previously aborted on the first cluster failure, preventing subsequent clusters from being processed
This masked whether failures were isolated to a single cluster (e.g. expired AWS STS credentials on build07) or affected multiple clusters
Collect errors from all clusters and report them together at the end so that one broken cluster does not block updates to the others

Context

The breaking-changes presubmit has been failing consistently because build07's AWS STS/OIDC credentials return a 401 on ec2:DescribeSecurityGroups. Since build07 is processed before build11 (alphabetical order), the fatal error prevents us from knowing whether build11 is also affected.

Test plan

Verify breaking-changes presubmit progresses past build07 and processes build11
Verify all cluster errors are reported in the final output

Made with Cursor

Summary by CodeRabbit

Improvements
- Cluster configuration error handling now aggregates and reports all issues simultaneously rather than halting at the first error, enabling comprehensive visibility into configuration problems for more efficient troubleshooting.
- Enhanced error messages include cluster-specific details for improved clarity during issue resolution.

The `cluster-init onboard config update` command previously aborted on the first cluster failure, preventing subsequent clusters from being processed. This masked whether failures were isolated to a single cluster or affected multiple clusters. Collect errors from all clusters and report them together at the end so that one broken cluster does not block updates to the others. Made-with: Cursor

openshift-ci-robot · 2026-03-07T14:06:01Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

coderabbitai · 2026-03-07T14:06:20Z

Walkthrough

Modified error handling in a cluster configuration update function to aggregate multiple errors across cluster processing iterations rather than returning immediately on the first failure. The function now collects wrapped errors and returns them all using errors.Join.

Changes

Cohort / File(s)	Summary
Cluster Config Error Aggregation `cmd/cluster-init/cmd/onboard/config/update.go`	Changed error handling from fail-fast approach to error aggregation; collects per-cluster errors with formatted context ("cluster : ") during runtime info and config update steps, returning all errors via `errors.Join` instead of early returns. Added "errors" import.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main change: modifying cluster-init to continue processing remaining clusters instead of aborting on the first failure.
Stable And Deterministic Test Names	✅ Passed	The custom check regarding stable and deterministic Ginkgo test names is not applicable to this PR. The PR only modifies cmd/cluster-init/cmd/onboard/config/update.go, a command-line tool implementation file. There is no test file and no Ginkgo tests are introduced or modified.
Test Structure And Quality	✅ Passed	This pull request modifies production code in update.go to improve error aggregation, not Ginkgo test code. The custom check for Ginkgo test quality is not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

cmd/cluster-init/cmd/onboard/config/update.go (1)
94-101: Consider adding kubeconfig failures to the aggregated errors.

The error from newKubeClients (line 98-101) is logged but not added to the errs slice, whereas errors from addClusterInstallRuntimeInfo and runConfigSteps are collected. This inconsistency means kubeconfig failures won't be surfaced to the caller in the returned error.

If this is intentional (e.g., missing kubeconfig is a configuration vs. runtime issue), a brief comment would clarify the design decision.
♻️ Optional: collect kubeconfig errors for consistent reporting
 	var errs []error
 	for clusterName, clusterInstall := range clusterInstalls {
 		ctrlClient, kubeClient, config, err := newKubeClients(kubeconfigs, clusterName)
-		clusterInstall.Config = config
 		if err != nil {
+			errs = append(errs, fmt.Errorf("kubeconfig for cluster %s: %w", clusterName, err))
 			log.WithField("cluster", clusterName).WithError(err).Warn("Skipping cluster due to missing or invalid kubeconfig")
 			continue
 		}
+		clusterInstall.Config = config
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cluster-init/cmd/onboard/config/update.go` around lines 94 - 101, The
kubeconfig error from newKubeClients is currently only logged and not appended
to the errs slice, causing kubeconfig failures to be omitted from the aggregated
return; update the loop handling (where newKubeClients is called and
clusterInstall.Config is set) to append the returned err to errs (same
aggregation used for addClusterInstallRuntimeInfo and runConfigSteps) before
continuing, or if skipping is intentional add a clarifying comment explaining
why kubeconfig errors are excluded; reference newKubeClients, errs,
clusterInstall, addClusterInstallRuntimeInfo, and runConfigSteps to locate the
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cmd/cluster-init/cmd/onboard/config/update.go`:
- Around line 94-101: The kubeconfig error from newKubeClients is currently only
logged and not appended to the errs slice, causing kubeconfig failures to be
omitted from the aggregated return; update the loop handling (where
newKubeClients is called and clusterInstall.Config is set) to append the
returned err to errs (same aggregation used for addClusterInstallRuntimeInfo and
runConfigSteps) before continuing, or if skipping is intentional add a
clarifying comment explaining why kubeconfig errors are excluded; reference
newKubeClients, errs, clusterInstall, addClusterInstallRuntimeInfo, and
runConfigSteps to locate the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d19f1f9e-6794-44a4-8c04-6f51d3b046fc

📥 Commits

Reviewing files that changed from the base of the PR and between 49a6a77 and e68bd3e.

📒 Files selected for processing (1)

cmd/cluster-init/cmd/onboard/config/update.go

openshift-ci-robot · 2026-03-07T15:13:55Z

Scheduling required tests:
/test e2e

psalajova · 2026-03-09T08:51:48Z

/lgtm

openshift-ci · 2026-03-09T08:52:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: joepvd, psalajova

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [psalajova]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

joepvd · 2026-03-09T13:02:01Z

/retest-required

openshift-ci-robot · 2026-03-09T14:08:51Z

Scheduling required tests:
/test e2e

joepvd · 2026-03-10T09:02:47Z

/test e2e

joepvd · 2026-03-10T11:34:51Z

/test e2e

joepvd · 2026-03-11T07:31:25Z

/override e2e

openshift-ci · 2026-03-11T07:31:40Z

@joepvd: joepvd unauthorized: /override is restricted to Repo administrators, approvers in top level OWNERS file, and the following github teams:openshift: openshift-release-oversight openshift-staff-engineers openshift-sustaining-engineers.

Details

In response to this:

/override e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

joepvd · 2026-03-11T07:32:20Z

/test e2e

joepvd · 2026-03-11T19:41:20Z

/test e2e

joepvd · 2026-03-12T06:57:44Z

/test e2e

openshift-ci · 2026-03-12T09:00:15Z

@joepvd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/breaking-changes	`e68bd3e`	link	false	`/test breaking-changes`
ci/prow/images	`e68bd3e`	link	unknown	`/test images`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from deepsm007 and smg247 March 7, 2026 14:06

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

openshift-ci bot assigned psalajova Mar 9, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2026

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 9, 2026

Conversation

joepvd commented Mar 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Mar 7, 2026

Uh oh!

coderabbitai bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Mar 7, 2026

Uh oh!

psalajova commented Mar 9, 2026

Uh oh!

openshift-ci bot commented Mar 9, 2026

Uh oh!

joepvd commented Mar 9, 2026

Uh oh!

openshift-ci-robot commented Mar 9, 2026

Uh oh!

joepvd commented Mar 10, 2026

Uh oh!

joepvd commented Mar 10, 2026

Uh oh!

joepvd commented Mar 11, 2026

Uh oh!

openshift-ci bot commented Mar 11, 2026

Uh oh!

joepvd commented Mar 11, 2026

Uh oh!

joepvd commented Mar 11, 2026

Uh oh!

joepvd commented Mar 12, 2026

Uh oh!

openshift-ci bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joepvd commented Mar 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 7, 2026 •

edited

Loading