Skip to content

[Fix] Added HSDP functionality for Megatron FSDP#113

Open
AllenFarcas wants to merge 4 commits intorocm_devfrom
alfarcas/issue11723-fix
Open

[Fix] Added HSDP functionality for Megatron FSDP#113
AllenFarcas wants to merge 4 commits intorocm_devfrom
alfarcas/issue11723-fix

Conversation

@AllenFarcas
Copy link
Collaborator

Motivation

Added HSDP support for Megatron FSDP

Technical Details

Fixes issue https://github.com/ROCm/frameworks-internal/issues/11723

Test Plan

Ran Llama3 and Llama2 examples with the new changes.

Test Result

All tests pass.

Submission Checklist

- **MEGATRON_FSDP:**
`1` to enable Megatron-LM's custom FSDP with DTensor checkpointing (default: 0). It adds automatically `--use-megatron-fsdp --ckpt-format fsdp_dtensor` in the script. Of note, this disables `TP>1` automatically.

- **ENABLE_HSDP:**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ENABLE_HSDP uses for anything else except guarding around HSDP_NUM_DIST_OPT_INSTANCES? Otherwise the latter parameter can be used independently, if set to something except 1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of just having the HSDP_NUM_DIST_OPT_INSTANCES parameter since it can be used independently, but would have liked to have some easy functionality like ENABLE_HSDP. I will remove the ENABLE_HSDP since it is redundant and leave only the HSDP_NUM_DIST_OPT_INSTANCES parameter.

@sudhu2k
Copy link
Collaborator

sudhu2k commented Feb 20, 2026

Hi @AllenFarcas, could you rebase this branch on top of the latest rocm_dev? IFU was completed yesterday.

@AllenFarcas AllenFarcas force-pushed the alfarcas/issue11723-fix branch from 878c824 to 4f5c06d Compare February 25, 2026 17:04
@AllenFarcas AllenFarcas requested a review from ipanfilo February 25, 2026 17:07
fi
fi

if [ "$ENABLE_HSDP" -eq 1 ] && [ "$MEGATRON_FSDP" -ne 1 ]; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does ENABLE_HSDP come from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it before doing a rebase but didn't save the changes. It should be fixed right now, with the HSDP_NUM_DIST_OPT_INSTANCES parameter.

@ipanfilo
Copy link

I do not see original changes that I commented in the PR anymore

@AllenFarcas
Copy link
Collaborator Author

@ipanfilo The original changes should still be there. I just removed the ENABLE_HSDP flag since it was indeed redundant and only kept the HSDP_NUM_DIST_OPT_INSTANCES

@AllenFarcas AllenFarcas requested a review from ipanfilo February 26, 2026 17:52
@ipanfilo
Copy link

@ipanfilo The original changes should still be there. I just removed the ENABLE_HSDP flag since it was indeed redundant and only kept the HSDP_NUM_DIST_OPT_INSTANCES

I see only commits from starting Feb 25 but there are older conversations so there were commits before that that are not seen anymore. Please avoid force commits on PRs that are being reivewed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright date

fi

if [ "$HSDP_NUM_DIST_OPT_INSTANCES" -gt 1 ] && [ "$MEGATRON_FSDP" -ne 1 ]; then
echo "Error: HSDP_NUM_DIST_OPT_INSTANCES>1 requires MEGATRON_FSDP=1"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: HSDP enablement requires.... ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, HSDP only works with Megatron FSDP enabled. We prompt the user to set it explicitly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright date

@AllenFarcas AllenFarcas requested a review from ipanfilo March 2, 2026 16:18
@AllenFarcas
Copy link
Collaborator Author

Streamlined the HSDP enablement with ENABLE_HSDP and HSDP_NUM_REPLICAS variables in the training scripts for better user understanding of the variables and their purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants