WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget#6809
WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget#6809
Conversation
… the max parallelism budget Signed-off-by: Fabio Grätz <fabio@cusp.ai>
|
Bito Automatic Review Skipped - Draft PR |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6809 +/- ##
=======================================
Coverage 56.93% 56.94%
=======================================
Files 929 929
Lines 58139 58146 +7
=======================================
+ Hits 33102 33110 +8
+ Misses 21996 21993 -3
- Partials 3041 3043 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@fg91 apologies for the significant delay. I'll take a look at this issue today |
Thank you!! |
|
Closing as fixed by @pvditt in #6929 (review). |
Why are the changes needed?
In this minimal reproducible example ...
... I observe the following and in my opinion undesirable behaviour:
Both the map task and the node after it start running.
The
n1pod succeeds quickly:The

n1node stays inQueuedstate in the Flyte UI despite the underlying pod already having succeeded:The
n1node is updated in the Flyte UI fromQueuedtoSucceededonly after the map task (or at least enough tasks within it) completes as well which could be hours later.Reason for this behaviour
In the
RecursiveNodeHandlerwhich traverses the workflow graph, we check whether the current degree of parallelism has exceeded the max parallelism:For
n1,IsMaxParallelismAchievedgivestrue:n1, the current degree of parallelism takes a value of31which is bigger than the default max parallelism of25.n1will be evaluated only once less than 25 tasks are still running within the map taskn0.What changes were proposed in this pull request?
n1to be marked as succeeded immediately after the pod completes and not hours later when enough of the array node tasks complete.n1, I would expectn1to not start at all untiln0is done. But as a user I wouldn't expect this "mixed" behaviour with a node that is seemingly stuck for hours despite having completed.Discussion
The behaviour can be avoided when modifying the parallelism tracking logic to count the map task as
1and not as1 + 30(in this example). I would like to discuss which of the two is the intended behaviour.How was this patch tested?
Check all the applicable boxes
Related PRs
Docs link