feat(pipeline): add zero-bubble-heuristic scheduling algorithm#618
feat(pipeline): add zero-bubble-heuristic scheduling algorithm#618ChengYao-amd wants to merge 2 commits intomainfrom
Conversation
bfc8922 to
139ce7a
Compare
139ce7a to
d430dbb
Compare
|
I did the comparison for Qwen3.5-235B (64 GPUs, MI355X, PP=4, VPP=1, EP=8, SeqLen=4096). Megatron ILP by Sea AI lab still performs better than the primus pipeline though the difference is much small now. | Config | zerobubble (ms) | zerobubble Tok/s/GPU | zerobubble Bubble% | zb-heuristic (ms) | zb-heuristic Tok/s/GPU | zb-heuristic Bubble% | megatron-ilp (ms) | megatron-ilp Tok/s/GPU | megatron-ilp Bubble% | Best | Final Tok/s/GPU-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
Summary
zero-bubble-heuristicpipeline parallelism scheduling algorithm that uses a graph-based heuristic to explore 8 candidate schedules (combinations ofallow_bubble_before_first_b,prioritize_b,no_bubble_greedy) and selects the one with the lowest bubble time.pp_max_mem,pp_cost_f,pp_cost_b,pp_cost_w) to control the memory budget and F/B/W cost model, enabling the scheduler to produce memory-aware schedules with realistic cost ratios.vis.py) with per-rank F/B/W time breakdown, correct cross-rank iteration time calculation, and detailed console output for easier performance analysis.Changes
Core Algorithm
zerobubble_heuristic.py(new): Self-contained implementation of the zero-bubble-heuristic scheduler, ported from the internal Megatron ZB module into the Primus scheduler framework. Implements_Graph(DAG-based scheduling),_initial_solution(best-of-8 heuristic search), andScheduleZeroBubbleHeuristic(thePipelineScheduleAlgosubclass that generates the schedule table with proper send/recv communication pairs).Integration
pipeline_launcher.py: Registerszero-bubble-heuristicas a valid algorithm, passesmax_mem/cost_f/cost_b/cost_wkwargs to the schedule factory, and addsdump_pp_datasupport viaschedule_wrapper.primus_turbo.py: Enables split W-grad operations for the new algorithm.schedule_table_factory.py: RegistersScheduleZeroBubbleHeuristicin the algorithm map; replaces@lru_cachewith a manual dict cache to support unhashable kwargs (lists).primus_pipeline.yaml: Adds config entries forpp_max_mem,pp_cost_f,pp_cost_b,pp_cost_w.megatron_pretrain_trainer.py: Adds post-training PP data dump for visualization/analysis.Visualization & Analysis
vis.py: Extractsget_fbw_times()helper; fixesiter_timeto use max across all ranks (not just rank-0); adds per-rank F/B/W time and percentage breakdown in console output.pp_simulation.yaml: Adds two example simulation configs (zb-heuristic-mem8,zb-heuristic-mem10).Algorithm Visualization