[FLINK-38403][tests] Fix the unexpected test that the second job does not restore from checkpoint #27254

1996fanrui · 2025-11-19T17:54:15Z

What is the purpose of the change

Find latest checkpoint should be placed in catch block as well,

Get more details from https://github.com/apache/flink/pull/27119/files#r2542189639

The second job expects failover once, and finishes after source generates all records. So removing @ThrowableAnnotation(ThrowableType.NonRecoverableError) for TestException.

Also, I introduced ExpectedFinalJobStatus in UnalignedSettings to check the final JobStatus.

Get more details from #27254 (comment)

Brief change log

[FLINK-38403][tests] Fix the unexpected test that the second job does not restore from checkpoint

flinkbot · 2025-11-19T17:57:55Z

CI report:

8938499 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

1996fanrui · 2025-11-19T18:00:24Z

flink-tests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointTestBase.java

-                conf.set(StateRecoveryOptions.SAVEPOINT_PATH, restoreCheckpoint.toURI().toString());
+                conf.set(StateRecoveryOptions.SAVEPOINT_PATH, restoreCheckpoint);


It adds the wrong prefix, so updated restoreCheckpoint to String.

AHeise

Thank you very much for fixing this test! Left two remarks that need to be addressed before approval.

AHeise · 2025-11-19T20:24:47Z

...ests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointRescaleITCase.java

+        assertNotNull(
+                "First job must generate a checkpoint for rescale test to be valid.",
+                checkpointDir);


Please use assertj (assertThat(checkpointDir).as("First job must generate a checkpoint for rescale test to be valid.").isNotNull)

AHeise · 2025-11-19T20:26:53Z

flink-tests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointTestBase.java

                return CommonTestUtils.getLatestCompletedCheckpointPath(
                                jobID, miniCluster.getMiniCluster())
-                        .map(File::new)
                        .orElseThrow(() -> new AssertionError("Could not generate checkpoint"));


Should we Fail.fail("Expected exception") here?

1996fanrui

Thanks @AHeise for the quick review, all comments are addressed.

snuyanzin · 2025-11-20T21:14:34Z

fyi: to have green ci, rebase to the latest master
e2e was fixed at FLINK-38700

1996fanrui · 2025-11-21T12:12:50Z

fyi: to have green ci, rebase to the latest master e2e was fixed at FLINK-38700

Thanks @snuyanzin for the reminder, I am still changing the PR, and I will rebase master branch before next push.

… not restore from checkpoint

…ished eventually Add comments to help understand the UnalignedCheckpointRescaleITCase

1996fanrui · 2025-11-26T13:20:41Z

...ests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointRescaleITCase.java

+     * <p>Postscale phase: Job restores from checkpoint with different parallelism, failovers once,
+     * and finishes after source generates all records.


The second job expects failover once, and finishes after source generates all records. So removing @ThrowableAnnotation(ThrowableType.NonRecoverableError) for TestException.

Also, I introduced ExpectedFinalJobStatus in UnalignedSettings to check the final JobStatus.

pnowojski · 2025-12-04T12:19:46Z

flink-tests/src/test/java/org/apache/flink/test/checkpointing/UnalignedCheckpointTestBase.java

+            RestartStrategyUtils.configureFixedDelayRestartStrategy(
+                    env, generateCheckpoint ? expectedFailures / 2 : expectedFailures, 100L);


What's the reasoning behind the calculated number of expected failures?

It reverts the change in https://github.com/apache/flink/pull/27119/files#diff-ace775e80e66d4f4001bdaea6bcbaae1975bd5e9a5497532d8d7152e4090069aL752

The original intention is :

generateCheckpoint controls the operational phase: true is for the job before rescaling, and false is for the new job after rescaling

The value expectedFailures / 2 acts as the failure threshold for the first job. This setup ensures that the first job fails after half of the expected exceptions are met, allowing the second job to automatically recover from the generated checkpoint and continue consumption.

1996fanrui requested a review from AHeise November 19, 2025 17:56

1996fanrui commented Nov 19, 2025

View reviewed changes

AHeise reviewed Nov 19, 2025

View reviewed changes

1996fanrui force-pushed the 38403/ITCase-rescale-not-restore branch from 80692f1 to 4e675d2 Compare November 20, 2025 11:23

1996fanrui commented Nov 20, 2025

View reviewed changes

1996fanrui force-pushed the 38403/ITCase-rescale-not-restore branch from 4e675d2 to 736ad29 Compare November 20, 2025 14:23

1996fanrui added 2 commits November 26, 2025 14:04

[FLINK-38403][tests] Fix the unexpected test that the second job does…

9828258

… not restore from checkpoint

Keep the TestException is Recoverable to ensure the second job is fin…

8938499

…ished eventually Add comments to help understand the UnalignedCheckpointRescaleITCase

1996fanrui force-pushed the 38403/ITCase-rescale-not-restore branch from 736ad29 to 8938499 Compare November 26, 2025 13:05

1996fanrui commented Nov 26, 2025

View reviewed changes

1996fanrui requested a review from AHeise November 28, 2025 14:22

pnowojski reviewed Dec 4, 2025

View reviewed changes

pnowojski approved these changes Dec 4, 2025

View reviewed changes

1996fanrui merged commit c4d6344 into apache:master Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-38403][tests] Fix the unexpected test that the second job does not restore from checkpoint #27254

[FLINK-38403][tests] Fix the unexpected test that the second job does not restore from checkpoint #27254

1996fanrui commented Nov 19, 2025 •

edited

Loading

Uh oh!

flinkbot commented Nov 19, 2025 •

edited

Loading

Uh oh!

1996fanrui Nov 19, 2025

Uh oh!

AHeise left a comment

Uh oh!

AHeise Nov 19, 2025

Uh oh!

AHeise Nov 19, 2025

Uh oh!

1996fanrui left a comment

Uh oh!

snuyanzin commented Nov 20, 2025 •

edited

Loading

Uh oh!

1996fanrui commented Nov 21, 2025

Uh oh!

1996fanrui Nov 26, 2025

Uh oh!

pnowojski Dec 4, 2025

Uh oh!

1996fanrui Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		conf.set(StateRecoveryOptions.SAVEPOINT_PATH, restoreCheckpoint.toURI().toString());
		conf.set(StateRecoveryOptions.SAVEPOINT_PATH, restoreCheckpoint);

		* <p>Postscale phase: Job restores from checkpoint with different parallelism, failovers once,
		* and finishes after source generates all records.

		RestartStrategyUtils.configureFixedDelayRestartStrategy(
		env, generateCheckpoint ? expectedFailures / 2 : expectedFailures, 100L);

[FLINK-38403][tests] Fix the unexpected test that the second job does not restore from checkpoint #27254

[FLINK-38403][tests] Fix the unexpected test that the second job does not restore from checkpoint #27254

Conversation

1996fanrui commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Uh oh!

flinkbot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

1996fanrui Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

AHeise left a comment

Choose a reason for hiding this comment

Uh oh!

AHeise Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

AHeise Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

1996fanrui left a comment

Choose a reason for hiding this comment

Uh oh!

snuyanzin commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

1996fanrui commented Nov 21, 2025

Uh oh!

1996fanrui Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

pnowojski Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

1996fanrui Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

1996fanrui commented Nov 19, 2025 •

edited

Loading

flinkbot commented Nov 19, 2025 •

edited

Loading

snuyanzin commented Nov 20, 2025 •

edited

Loading