-
Notifications
You must be signed in to change notification settings - Fork 566
🌱 (e2e): test: skip terminating pods in catalog image update test #3703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
🌱 (e2e): test: skip terminating pods in catalog image update test #3703
Conversation
Filter out pods with DeletionTimestamp to avoid false failures during pod rollouts when old pods are being deleted. Fixes flaky test failure where test found 2 pods (1 terminating, 1 active) and incorrectly failed on pod count instead of verifying the actual requirement: catalog image was updated. Assisted-by: Cursor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| Expect(registryPods).ShouldNot(BeNil(), "nil registry pods") | ||
| Expect(registryPods.Items).To(HaveLen(1), "unexpected number of registry pods found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to remove these checks? Would it be reasonable to expect that there's at least one registry pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because they are not part of the goal what we want to check
And in downstream we do not execute those in SERIAL so we can have more than 1 pod.
The check:
Expect(err).ShouldNot(HaveOccurred(), "error awaiting registry pod")
It is enough to validate what we want to validate.
By looking the failure it seems that I introduced a flake (downstream) with this test.
That was my motivation for the changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra checks never hurt.
And in downstream we do not execute those in SERIAL so we can have more than 1 pod.
These are namespace scoped right?
My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".
My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's ok to remove since the checks are embedded in the podCheck function. For awaitPodsWithInterval to succeed, podCheck needs to return true, which only happens when there's at least one pod in the new image.
Wdyt about using something like slices.DeleteFunc to filter out the terminating pods from the podList.Items and returning true when len == 1 and the pod image != registry image?
Then maybe just adding a comment that says the podCheck function checks the requirements. If it succeeds, so does the test.
| Expect(registryPods).ShouldNot(BeNil(), "nil registry pods") | ||
| Expect(registryPods.Items).To(HaveLen(1), "unexpected number of registry pods found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra checks never hurt.
And in downstream we do not execute those in SERIAL so we can have more than 1 pod.
These are namespace scoped right?
My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".
My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)
Taht is not true We are checking if has only 1 pod.
By looking at the failure we can know the error already.
What we want to check with this test still checked. |
This argument is really confusing me. These tests have always run in parallel. By this logic we should never have seen this test passing. Do we have any evidence to back up the claim that running these tests in parallel causes pod count to be more than 1, rendering the check useless? Even if the problem was in fact parallelism, the solution here would be to mark these tests to run in serial, not strip away code that gives us confidence on these tests. |
|
Hi @anik120
Sorry, but can you please give a look at the erro: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560 and description of this PR. The only check that we need to do here is: "Does an active pod exist with the new image?" If we have more than one pod ( as happened in downstream ) it should not fail if pass on that. |
Description
Fix flaky e2e test
[CatalogSource] image updatethat fails when terminating pods are present during catalog source rollouts.Problem
The test was failing with:
During catalog source image updates, there can be 2 pods temporarily:
DeletionTimestampset)The test was counting terminating pods and failing instead of focusing on the actual requirement: verifying the catalog image was updated.
Solution
Skip pods with
DeletionTimestamp != nilwhen checking for the updated image.This makes the test resilient to transient states during pod rollouts while still verifying the core requirement.
Changes
podCheckFunc(3 lines)Testing