Skip to content

Joyce collective for mi355 &b200 cluster#1

Open
joyjie818 wants to merge 13 commits intoZ-Y00:mainfrom
joyjie818:joyce_collective
Open

Joyce collective for mi355 &b200 cluster#1
joyjie818 wants to merge 13 commits intoZ-Y00:mainfrom
joyjie818:joyce_collective

Conversation

@joyjie818
Copy link
Copy Markdown

Script for 3items
1.node2node submission in MI355 cluster -> submit_pairs.sh(For DCPT cluster)
2.Slurm script for B200 transferred to enroot+srun based for collective ops
3.node2node submission in B200 cluster -> submit_b200.sh(GPUA818/GPUA7DD nodes)

jasainio and others added 13 commits February 25, 2026 17:46
Adds a patch to fix Megatron FSDP compatibility with PyTorch 2.10+. The
patch updates get_mesh_names to use the new DeviceMesh API
(_get_root_mesh() and _flatten_mapping) instead of the deprecated
_mesh_resources.child_to_root_mapping removed in PyTorch 2.10. The patch
is automatically applied when use_megatron_fsdp is enabled.

Co-authored-by: WangLingxun <linxwang@amd.com>
Adds support for CPU initialization in Primus Turbo linear layers
(RowParallelLinear, ColumnParallelLinear, and LayerNormLinear). When
use_cpu_initialization is enabled, the patch disables custom init
methods by passing a no-op lambda, allowing Megatron's CPU
initialization to work correctly with Primus Turbo's custom layer
implementations.

Co-authored-by: WangLingxun <linxwang@amd.com>
Previously, the evaluation loss was computed per iteration and
overwritten, leading to incorrect averaging when multiple eval
iterations are used.
This fix accumulates the numerator and denominator separately across all
eval iterations and computes the final average at the end.
Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
…odes (AMD-AGI#554)

### Changes:
Only flag imbalance if the COUNT of GPUs on each node differs.
Example: 
4 on Node 0, 4 on Node 1 -> counts=[4,4] -> set={4} -> len=1 -> NOT
imbalanced.
7 on Node 0, 1 on Node 1 -> counts=[7,1] -> set={7,1} -> len=2 ->
Imbalanced.

### Reason for changes:
The previous logic would issue a NUMA imbalance warning if not all GPUs
were connected to the same node, resulting in a false positive when
using a multi-socket CPU.

---------

Co-authored-by: Xiaoming-AMD <Xiaoming.Peng@amd.com>
Created by Joyce for their DCGPU cluster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants