🌱 (e2e): test: skip terminating pods in catalog image update test #3703

camilamacedo86 · 2025-11-10T07:25:06Z

Description

Fix flaky e2e test [CatalogSource] image update that fails when terminating pods are present during catalog source rollouts.

Problem

The test was failing with:

unexpected number of registry pods found
Expected <[]v1.Pod | len:2> to have length 1

During catalog source image updates, there can be 2 pods temporarily:

1 old pod being deleted (with DeletionTimestamp set)
1 new pod starting up

The test was counting terminating pods and failing instead of focusing on the actual requirement: verifying the catalog image was updated.

Solution

Skip pods with DeletionTimestamp != nil when checking for the updated image.

This makes the test resilient to transient states during pod rollouts while still verifying the core requirement.

Changes

Add check to skip terminating pods in podCheckFunc (3 lines)
Test now focuses on: "Does an active pod exist with the new image?"

Testing

Fixes downstream failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560

Filter out pods with DeletionTimestamp to avoid false failures during pod rollouts when old pods are being deleted. Fixes flaky test failure where test found 2 pods (1 terminating, 1 active) and incorrectly failed on pod count instead of verifying the actual requirement: catalog image was updated. Assisted-by: Cursor

openshift-ci · 2025-11-10T07:25:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevinrizza for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

perdasilva · 2025-11-10T10:39:12Z

test/e2e/catalog_e2e_test.go

-		Expect(registryPods).ShouldNot(BeNil(), "nil registry pods")
-		Expect(registryPods.Items).To(HaveLen(1), "unexpected number of registry pods found")


Do we want to remove these checks? Would it be reasonable to expect that there's at least one registry pod?

Yes, because they are not part of the goal what we want to check
And in downstream we do not execute those in SERIAL so we can have more than 1 pod.

The check:

Expect(err).ShouldNot(HaveOccurred(), "error awaiting registry pod")

It is enough to validate what we want to validate.
By looking the failure it seems that I introduced a flake (downstream) with this test.
That was my motivation for the changes here.

Extra checks never hurt.

And in downstream we do not execute those in SERIAL so we can have more than 1 pod.

These are namespace scoped right?

My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".

My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)

Maybe it's ok to remove since the checks are embedded in the podCheck function. For awaitPodsWithInterval to succeed, podCheck needs to return true, which only happens when there's at least one pod in the new image.

Wdyt about using something like slices.DeleteFunc to filter out the terminating pods from the podList.Items and returning true when len == 1 and the pod image != registry image?

Then maybe just adding a comment that says the podCheck function checks the requirements. If it succeeds, so does the test.

anik120 · 2025-11-10T13:49:00Z

test/e2e/catalog_e2e_test.go

-		Expect(registryPods).ShouldNot(BeNil(), "nil registry pods")
-		Expect(registryPods.Items).To(HaveLen(1), "unexpected number of registry pods found")


Extra checks never hurt.

And in downstream we do not execute those in SERIAL so we can have more than 1 pod.

These are namespace scoped right?

My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".

My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)

camilamacedo86 · 2025-11-10T14:45:58Z

Extra checks never hurt.

Taht is not true
See;
Fixes downstream failure: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560

We are checking if has only 1 pod.
However, the downstream tests runs in paralallel that is why we can have more than 1 pod.

My ideal solution here would be to not remove these checks in this PR, observe if we're still getting flaky results, and then look into why we are (possibly) "still failing".

By looking at the failure we can know the error already.

My main concern is that removing these checks reduces the sanctity of this test, at least the way it reads serially (and was intended to be read)

What we want to check with this test still checked.

anik120 · 2025-11-10T15:19:06Z

However, the downstream tests runs in paralallel that is why we can have more than 1 pod.

This argument is really confusing me. These tests have always run in parallel. By this logic we should never have seen this test passing.

Do we have any evidence to back up the claim that running these tests in parallel causes pod count to be more than 1, rendering the check useless?

Even if the problem was in fact parallelism, the solution here would be to mark these tests to run in serial, not strip away code that gives us confidence on these tests.

camilamacedo86 · 2025-11-10T15:43:26Z

Hi @anik120

This argument is really confusing me.

Sorry, but can you please give a look at the erro: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_operator-framework-olm/1142/pull-ci-openshift-operator-framework-olm-main-e2e-gcp-olm/1987692171796418560 and description of this PR.

The only check that we need to do here is: "Does an active pod exist with the new image?"

If we have more than one pod ( as happened in downstream ) it should not fail if pass on that.

openshift-ci bot requested review from joelanford and tmshort November 10, 2025 07:25

camilamacedo86 mentioned this pull request Nov 10, 2025

NO-ISSUE: Synchronize From Upstream Repositories openshift/operator-framework-olm#1142

Merged

camilamacedo86 changed the title ~~(e2e): test: skip terminating pods in catalog image update test~~ 🌱 (e2e): test: skip terminating pods in catalog image update test Nov 10, 2025

camilamacedo86 requested review from anik120 and perdasilva November 10, 2025 07:29

perdasilva reviewed Nov 10, 2025

View reviewed changes

anik120 requested changes Nov 10, 2025

View reviewed changes

openshift-ci bot assigned anik120 Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🌱 (e2e): test: skip terminating pods in catalog image update test #3703

🌱 (e2e): test: skip terminating pods in catalog image update test #3703

Uh oh!

camilamacedo86 commented Nov 10, 2025

Uh oh!

openshift-ci bot commented Nov 10, 2025

Uh oh!

perdasilva Nov 10, 2025

Uh oh!

camilamacedo86 Nov 10, 2025 •

edited

Loading

Uh oh!

anik120 Nov 10, 2025

Uh oh!

perdasilva Nov 10, 2025

Uh oh!

anik120 Nov 10, 2025

Uh oh!

camilamacedo86 commented Nov 10, 2025 •

edited

Loading

Uh oh!

anik120 commented Nov 10, 2025 •

edited

Loading

Uh oh!

camilamacedo86 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		Expect(registryPods).ShouldNot(BeNil(), "nil registry pods")
		Expect(registryPods.Items).To(HaveLen(1), "unexpected number of registry pods found")

🌱 (e2e): test: skip terminating pods in catalog image update test #3703

Are you sure you want to change the base?

🌱 (e2e): test: skip terminating pods in catalog image update test #3703

Uh oh!

Conversation

camilamacedo86 commented Nov 10, 2025

Description

Problem

Solution

Changes

Testing

Uh oh!

openshift-ci bot commented Nov 10, 2025

Uh oh!

perdasilva Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anik120 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

perdasilva Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

anik120 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

camilamacedo86 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anik120 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camilamacedo86 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

camilamacedo86 Nov 10, 2025 •

edited

Loading

camilamacedo86 commented Nov 10, 2025 •

edited

Loading

anik120 commented Nov 10, 2025 •

edited

Loading