WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget by fg91 · Pull Request #6809 · flyteorg/flyte

fg91 · 2025-12-18T10:54:00Z

Why are the changes needed?

In this minimal reproducible example ...

from flytekit import task, workflow, map_task
import time

@task
def sleep(inp: int) -> None:
    time.sleep(inp)


@workflow
def wf() -> None:
    map_task(sleep)(inp=[3600 for _ in range(30)])  # array node with long running tasks. 30 is chosen to exceed default max parallelism of 25
    sleep(inp=30)  # Task that runs more quickly

... I observe the following and in my opinion undesirable behaviour:

Both the map task and the node after it start running.
The n1 pod succeeds quickly:
The n1 node stays in Queued state in the Flyte UI despite the underlying pod already having succeeded:
The n1 node is updated in the Flyte UI from Queued to Succeeded only after the map task (or at least enough tasks within it) completes as well which could be hours later.

Reason for this behaviour

In the RecursiveNodeHandler which traverses the workflow graph, we check whether the current degree of parallelism has exceeded the max parallelism:

func (c *recursiveNodeExecutor) RecursiveNodeHandler(ctx context.Context, execContext executors.ExecutionContext, dag executors.DAGStructure, nl executors.NodeLookup, currentNode v1alpha1.ExecutableNode) ( interfaces.NodeStatus, error) {
    ...
    if IsMaxParallelismAchieved(ctx, currentNode, nodePhase, execContext) {
        ...
        return interfaces.NodeStatusRunning, nil

For n1, IsMaxParallelismAchieved gives true:
- In the example above, when evaluating n1, the current degree of parallelism takes a value of 31 which is bigger than the default max parallelism of 25.
- The degree of parallelism is increased by 1 for the array node itself here and 30 times for each of the tasks in the map task here.
- This means that the task phase of n1 will be evaluated only once less than 25 tasks are still running within the map task n0.

What changes were proposed in this pull request?

As a user, I would expect n1 to be marked as succeeded immediately after the pod completes and not hours later when enough of the array node tasks complete.
Alternatively, if there is no "parallelism budget" for n1, I would expect n1 to not start at all until n0 is done. But as a user I wouldn't expect this "mixed" behaviour with a node that is seemingly stuck for hours despite having completed.

Discussion

The behaviour can be avoided when modifying the parallelism tracking logic to count the map task as 1 and not as 1 + 30 (in this example). I would like to discuss which of the two is the intended behaviour.

How was this patch tested?

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

… the max parallelism budget Signed-off-by: Fabio Grätz <fabio@cusp.ai>

flyte-bot · 2025-12-18T10:54:38Z

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at eduardo@union.ai.

codecov · 2025-12-18T10:57:46Z

Codecov Report

❌ Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.94%. Comparing base (be517bd) to head (d44e52f).
⚠️ Report is 12 commits behind head on master.

Files with missing lines	Patch %	Lines
...lytepropeller/pkg/controller/nodes/task/handler.go	66.66%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6809   +/-   ##
=======================================
  Coverage   56.93%   56.94%           
=======================================
  Files         929      929           
  Lines       58139    58146    +7     
=======================================
+ Hits        33102    33110    +8     
+ Misses      21996    21993    -3     
- Partials     3041     3043    +2

Flag	Coverage Δ
unittests-datacatalog	`53.51% <ø> (ø)`
unittests-flyteadmin	`53.14% <ø> (+0.03%)`	⬆️
unittests-flytecopilot	`43.06% <ø> (ø)`
unittests-flytectl	`64.02% <ø> (-0.06%)`	⬇️
unittests-flyteidl	`75.71% <ø> (ø)`
unittests-flyteplugins	`60.13% <ø> (ø)`
unittests-flytepropeller	`53.54% <66.66%> (+<0.01%)`	⬆️
unittests-flytestdlib	`63.29% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pvditt · 2026-02-19T18:58:58Z

@fg91 apologies for the significant delay. I'll take a look at this issue today

fg91 · 2026-02-19T19:00:38Z

@fg91 apologies for the significant delay. I'll take a look at this issue today

Thank you!!

fg91 · 2026-02-20T14:34:56Z

Closing as fixed by @pvditt in #6929 (review).
Thank you!

WIP: Fix stuck workflow nodes that are starved by map tasks exceeding…

d44e52f

… the max parallelism budget Signed-off-by: Fabio Grätz <fabio@cusp.ai>

fg91 requested review from Sovietaced and hamersaw December 18, 2025 11:25

fg91 requested a review from pvditt January 22, 2026 19:20

pvditt mentioned this pull request Feb 20, 2026

don't reuse parent wf state for array node subnodes #6929

Open

3 tasks

fg91 closed this Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget#6809

WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget#6809
fg91 wants to merge 1 commit intomasterfrom
fg91/fix/map-task-parallelism-starvation

fg91 commented Dec 18, 2025 •

edited

Loading

Uh oh!

flyte-bot commented Dec 18, 2025

Uh oh!

codecov bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

pvditt commented Feb 19, 2026

Uh oh!

fg91 commented Feb 19, 2026

Uh oh!

fg91 commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

fg91 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

Reason for this behaviour

What changes were proposed in this pull request?

Discussion

How was this patch tested?

Check all the applicable boxes

Related PRs

Docs link

Uh oh!

flyte-bot commented Dec 18, 2025

Uh oh!

codecov bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pvditt commented Feb 19, 2026

Uh oh!

fg91 commented Feb 19, 2026

Uh oh!

fg91 commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fg91 commented Dec 18, 2025 •

edited

Loading

codecov bot commented Dec 18, 2025 •

edited

Loading