Skip to content

Comments

Prevent panic when retry attempt value in map task gets larger than bitarray was allocated for#6802

Merged
Sovietaced merged 3 commits intomasterfrom
fg91/fix-node-array-retry-panic
Feb 10, 2026
Merged

Prevent panic when retry attempt value in map task gets larger than bitarray was allocated for#6802
Sovietaced merged 3 commits intomasterfrom
fg91/fix-node-array-retry-panic

Conversation

@fg91
Copy link
Member

@fg91 fg91 commented Dec 12, 2025

Why are the changes needed?

With a default configuration of max-node-retries-system-failures=3 and default-max-attempts=1, the following workflow leads to a panic in the node array handler:

@task(
    requests=Resources(cpu="50m", mem="250Mi"),
    limits=Resources(cpu="100m", mem="250Mi"),
    retries=8,
)
def test(inp: int) -> None:
    if inp == 2:
        # Go OOM on purpose or any other "retriable user error"
        a = " " * (300 * 1024 * 1024)


@workflow 
def wf1() -> None:
    map_task(test)(inp=[i for i in range(3)])

The panic occurs here when validating whether the current value of attempts can be tracked in a bitarray that was allocated for this purpose.

The following goes wrong:

Currently the retries=8 set on the task decorator is not taken into account when determining the max value of the number of attempts: Here in the array node handler, for the example workflow above, nCtx.Node().GetRetryStrategy() is nil. maxAttemptsValue is therefore max-node-retries-system-failures=3 + default-max-attempts=1 equals to maxAttemptsValue=4. When creating the bit array to track the retry attempts here, it is determined that 3 bits are needed to store values that cannot be higher than 4.

However, the task is actually retried 8 times. Up to attempt 7 this still works because only 3 bits are needed. But attempt 8 then panics here.

What changes were proposed in this pull request?

When determining the max value that can be stored in the bit array, we need to account not only for nCtx.Node().GetRetryStrategy() but also for nCtx.Node().GetArrayNode().GetSubNodeSpec().GetRetryStrategy() which does contain the desired retries=8.

During my debugging, nCtx.Node().GetRetryStrategy() was actually always nil but I cannot say for sure this is always the case. So I propose to allocate the bit array to have space for the max of both values to be on the safe side.

How was this patch tested?

The example above panics on latest master but works with the changes in this PR.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Summary by Bito

  • Addresses a critical bug in the node array handler that could lead to a panic when the retry attempt value exceeds the allocated bit array size.
  • Ensures that both the node's retry strategy and the sub-node's retry strategy are considered when determining the maximum number of attempts.
  • Enhances the stability of workflows that involve retries, particularly in scenarios where the retry count is high.
  • Overall, this pull request addresses a critical bug in the node array handler, improving stability in retry workflows.

…itarray was allocated for

Signed-off-by: Fabio Grätz <fabio@cusp.ai>
@fg91 fg91 added the fixed For any bug fixes label Dec 12, 2025
@fg91 fg91 requested a review from hamersaw December 12, 2025 16:14
@codecov
Copy link

codecov bot commented Dec 12, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.96%. Comparing base (14b6bbf) to head (797064a).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...ytepropeller/pkg/controller/nodes/array/handler.go 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6802      +/-   ##
==========================================
+ Coverage   56.95%   56.96%   +0.01%     
==========================================
  Files         929      929              
  Lines       58152    58156       +4     
==========================================
+ Hits        33121    33131      +10     
+ Misses      21989    21983       -6     
  Partials     3042     3042              
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.14% <ø> (+0.03%) ⬆️
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.02% <ø> (ø)
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.14% <ø> (ø)
unittests-flytepropeller 53.64% <66.66%> (+0.01%) ⬆️
unittests-flytestdlib 63.29% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fg91 fg91 requested a review from Sovietaced December 12, 2025 16:37
Sovietaced
Sovietaced previously approved these changes Dec 15, 2025
Copy link
Member

@Sovietaced Sovietaced left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unit test coverage possible?

pingsutw
pingsutw previously approved these changes Dec 16, 2025
@fg91
Copy link
Member Author

fg91 commented Dec 16, 2025

I'll try to add a unit test let's not merge yet.

@fg91 fg91 dismissed stale reviews from Sovietaced and pingsutw via 797064a February 10, 2026 10:27
@fg91
Copy link
Member Author

fg91 commented Feb 10, 2026

Is unit test coverage possible?

@Sovietaced I added a unit test that follows the pattern of TestHandleArrayNodePhaseExecutingSubNodeFailures and which would panic without the fix.

@fg91 fg91 requested a review from Sovietaced February 10, 2026 10:28
@Sovietaced Sovietaced merged commit 6d92fb0 into master Feb 10, 2026
49 checks passed
@Sovietaced Sovietaced deleted the fg91/fix-node-array-retry-panic branch February 10, 2026 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fixed For any bug fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants