Skip to content

Conversation

@camilamacedo86
Copy link
Contributor

Description

Fix flaky e2e test [CatalogSource] image update that fails when terminating pods are present during catalog source rollouts.

Problem

The test was failing with:

unexpected number of registry pods found
Expected <[]v1.Pod | len:2> to have length 1

During catalog source image updates, there can be 2 pods temporarily:

  • 1 old pod being deleted (with DeletionTimestamp set)
  • 1 new pod starting up

The test was counting terminating pods and failing instead of focusing on the actual requirement: verifying the catalog image was updated.

Solution

Skip pods with DeletionTimestamp != nil when checking for the updated image.

This makes the test resilient to transient states during pod rollouts while still verifying the core requirement.

Changes

  • Add check to skip terminating pods in podCheckFunc (3 lines)
  • Test now focuses on: "Does an active pod exist with the new image?"

Testing

Filter out pods with DeletionTimestamp to avoid false failures
during pod rollouts when old pods are being deleted.

Fixes flaky test failure where test found 2 pods (1 terminating,
1 active) and incorrectly failed on pod count instead of verifying
the actual requirement: catalog image was updated.

Assisted-by: Cursor
@openshift-ci openshift-ci bot requested review from joelanford and tmshort November 10, 2025 07:25
@openshift-ci
Copy link

openshift-ci bot commented Nov 10, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevinrizza for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@camilamacedo86 camilamacedo86 changed the title (e2e): test: skip terminating pods in catalog image update test 🌱 (e2e): test: skip terminating pods in catalog image update test Nov 10, 2025
Comment on lines -1199 to -1200
Expect(registryPods).ShouldNot(BeNil(), "nil registry pods")
Expect(registryPods.Items).To(HaveLen(1), "unexpected number of registry pods found")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to remove these checks? Would it be reasonable to expect that there's at least one registry pod?

Copy link
Contributor Author

@camilamacedo86 camilamacedo86 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because they are not part of the goal what we want to check
And in downstream we do not execute those in SERIAL so we can have more than 1 pod.

The check:

Expect(err).ShouldNot(HaveOccurred(), "error awaiting registry pod")

It is enough to validate what we want to validate.
By looking the failure it seems that I introduced a flake (downstream) with this test.
That was my motivation for the changes here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra checks never hurt.

And in downstream we do not execute those in SERIAL so we can have more than 1 pod.

These are namespace scoped right?

My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".

My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's ok to remove since the checks are embedded in the podCheck function. For awaitPodsWithInterval to succeed, podCheck needs to return true, which only happens when there's at least one pod in the new image.

Wdyt about using something like slices.DeleteFunc to filter out the terminating pods from the podList.Items and returning true when len == 1 and the pod image != registry image?

Then maybe just adding a comment that says the podCheck function checks the requirements. If it succeeds, so does the test.

Comment on lines -1199 to -1200
Expect(registryPods).ShouldNot(BeNil(), "nil registry pods")
Expect(registryPods.Items).To(HaveLen(1), "unexpected number of registry pods found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra checks never hurt.

And in downstream we do not execute those in SERIAL so we can have more than 1 pod.

These are namespace scoped right?

My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".

My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)

@camilamacedo86
Copy link
Contributor Author

camilamacedo86 commented Nov 10, 2025

Extra checks never hurt.

Taht is not true
See;
Fixes downstream failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560

We are checking if has only 1 pod.
However, the downstream tests runs in paralallel that is why we can have more than 1 pod.

My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".

By looking at the failure we can know the error already.

My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)

What we want to check with this test still checked.

@anik120
Copy link
Member

anik120 commented Nov 10, 2025

However, the downstream tests runs in paralallel that is why we can have more than 1 pod.

This argument is really confusing me. These tests have always run in parallel. By this logic we should never have seen this test passing.

Do we have any evidence to back up the claim that running these tests in parallel causes pod count to be more than 1, rendering the check useless?

Even if the problem was in fact parallelism, the solution here would be to mark these tests to run in serial, not strip away code that gives us confidence on these tests.

@camilamacedo86
Copy link
Contributor Author

Hi @anik120

This argument is really confusing me.

Sorry, but can you please give a look at the erro: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560 and description of this PR.

The only check that we need to do here is: "Does an active pod exist with the new image?"

If we have more than one pod ( as happened in downstream ) it should not fail if pass on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants