feat(pipeline): add zero-bubble-heuristic scheduling algorithm by ChengYao-amd · Pull Request #618 · AMD-AGI/Primus

ChengYao-amd · 2026-03-19T10:47:36Z

Summary

Adds a new zero-bubble-heuristic pipeline parallelism scheduling algorithm that uses a graph-based heuristic to explore 8 candidate schedules (combinations of allow_bubble_before_first_b, prioritize_b, no_bubble_greedy) and selects the one with the lowest bubble time.
Exposes configurable parameters (pp_max_mem, pp_cost_f, pp_cost_b, pp_cost_w) to control the memory budget and F/B/W cost model, enabling the scheduler to produce memory-aware schedules with realistic cost ratios.
Enhances the PP visualization tool (vis.py) with per-rank F/B/W time breakdown, correct cross-rank iteration time calculation, and detailed console output for easier performance analysis.

Changes

Core Algorithm

zerobubble_heuristic.py (new): Self-contained implementation of the zero-bubble-heuristic scheduler, ported from the internal Megatron ZB module into the Primus scheduler framework. Implements _Graph (DAG-based scheduling), _initial_solution (best-of-8 heuristic search), and ScheduleZeroBubbleHeuristic (the PipelineScheduleAlgo subclass that generates the schedule table with proper send/recv communication pairs).

Integration

pipeline_launcher.py: Registers zero-bubble-heuristic as a valid algorithm, passes max_mem/cost_f/cost_b/cost_w kwargs to the schedule factory, and adds dump_pp_data support via schedule_wrapper.
primus_turbo.py: Enables split W-grad operations for the new algorithm.
schedule_table_factory.py: Registers ScheduleZeroBubbleHeuristic in the algorithm map; replaces @lru_cache with a manual dict cache to support unhashable kwargs (lists).
primus_pipeline.yaml: Adds config entries for pp_max_mem, pp_cost_f, pp_cost_b, pp_cost_w.
megatron_pretrain_trainer.py: Adds post-training PP data dump for visualization/analysis.

Visualization & Analysis

vis.py: Extracts get_fbw_times() helper; fixes iter_time to use max across all ranks (not just rank-0); adds per-rank F/B/W time and percentage breakdown in console output.
pp_simulation.yaml: Adds two example simulation configs (zb-heuristic-mem8, zb-heuristic-mem10).

Algorithm Visualization

tools/visualization/pp_vis/vis.py

araina-amd · 2026-03-19T22:06:30Z

I did the comparison for Qwen3.5-235B (64 GPUs, MI355X, PP=4, VPP=1, EP=8, SeqLen=4096). Megatron ILP by Sea AI lab still performs better than the primus pipeline though the difference is much small now.
What I will do is for zero bubble I will still point the projection model to Megatron ILP by Sea AI lab for now and fallback to primus pipeline for all other cases.

| Config | zerobubble (ms) | zerobubble Tok/s/GPU | zerobubble Bubble% | zb-heuristic (ms) | zb-heuristic Tok/s/GPU | zb-heuristic Bubble% | megatron-ilp (ms) | megatron-ilp Tok/s/GPU | megatron-ilp Bubble% | Best | Final Tok/s/GPU

-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
1 | BF16, GBS=1024, MBS=2, RC=5 | 12,385.12 | 5,292 | 7.22% | 12,296.10 | 5,330 | 6.55% | 11,996.55 | 5,463 | 1.18% | megatron-ilp | 5,093
2 | FP8, GBS=1024, MBS=2, RC=5 | 11,788.54 | 5,559 | 7.18% | 11,743.55 | 5,581 | 6.83% | 11,425.97 | 5,736 | 1.15% | megatron-ilp | 5,329
3 | BF16, GBS=2048, MBS=2, RC=5 | 24,323.48 | 5,389 | 5.77% | 24,200.74 | 5,416 | 5.29% | 23,930.79 | 5,477 | 1.18% | megatron-ilp | 5,285
4 | FP8, GBS=2048, MBS=2, RC=5 | 23,584.90 | 5,557 | 5.73% | 23,519.74 | 5,573 | 5.47% | 23,212.43 | 5,647 | 1.15% | megatron-ilp | 5,442
5 | BF16, GBS=2048, MBS=4, RC=10 | 24,483.61 | 5,353 | 6.95% | 24,346.47 | 5,384 | 6.42% | 23,743.03 | 5,520 | 1.15% | megatron-ilp | 5,325
6 | FP8, GBS=2048, MBS=4, RC=10 | 22,176.38 | 5,910 | 6.95% | 22,176.38 | 5,910 | 6.95% | 21,598.20 | 6,069 | 1.48% | megatron-ilp | 5,833

github-code-quality bot found potential problems Mar 19, 2026

View reviewed changes

tools/visualization/pp_vis/vis.py Fixed Show fixed Hide fixed

ChengYao-amd force-pushed the dev/yc/add-zero-bubble-heuristic branch from bfc8922 to 139ce7a Compare March 19, 2026 10:53

ChengYao-amd requested review from Xiaoming-AMD, limou102 and wenxie-amd as code owners March 19, 2026 10:53

feat(pipeline): add zero-bubble-heuristic scheduling algorithm

d430dbb

ChengYao-amd force-pushed the dev/yc/add-zero-bubble-heuristic branch from 139ce7a to d430dbb Compare March 19, 2026 11:07

wenxie-amd approved these changes Mar 19, 2026

View reviewed changes

araina-amd self-requested a review March 19, 2026 21:56

Merge branch 'main' into dev/yc/add-zero-bubble-heuristic

66110c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pipeline): add zero-bubble-heuristic scheduling algorithm#618

feat(pipeline): add zero-bubble-heuristic scheduling algorithm#618
ChengYao-amd wants to merge 2 commits intomainfrom
dev/yc/add-zero-bubble-heuristic

ChengYao-amd commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

araina-amd commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ChengYao-amd commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Algorithm

Integration

Visualization & Analysis

Algorithm Visualization

Uh oh!

Uh oh!

araina-amd commented Mar 19, 2026

| Config | zerobubble (ms) | zerobubble Tok/s/GPU | zerobubble Bubble% | zb-heuristic (ms) | zb-heuristic Tok/s/GPU | zb-heuristic Bubble% | megatron-ilp (ms) | megatron-ilp Tok/s/GPU | megatron-ilp Bubble% | Best | Final Tok/s/GPU

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChengYao-amd commented Mar 19, 2026 •

edited

Loading