[Fix] Added HSDP functionality for Megatron FSDP#113
[Fix] Added HSDP functionality for Megatron FSDP#113AllenFarcas wants to merge 4 commits intorocm_devfrom
Conversation
| - **MEGATRON_FSDP:** | ||
| `1` to enable Megatron-LM's custom FSDP with DTensor checkpointing (default: 0). It adds automatically `--use-megatron-fsdp --ckpt-format fsdp_dtensor` in the script. Of note, this disables `TP>1` automatically. | ||
|
|
||
| - **ENABLE_HSDP:** |
There was a problem hiding this comment.
Is ENABLE_HSDP uses for anything else except guarding around HSDP_NUM_DIST_OPT_INSTANCES? Otherwise the latter parameter can be used independently, if set to something except 1
There was a problem hiding this comment.
I thought of just having the HSDP_NUM_DIST_OPT_INSTANCES parameter since it can be used independently, but would have liked to have some easy functionality like ENABLE_HSDP. I will remove the ENABLE_HSDP since it is redundant and leave only the HSDP_NUM_DIST_OPT_INSTANCES parameter.
|
Hi @AllenFarcas, could you rebase this branch on top of the latest rocm_dev? IFU was completed yesterday. |
878c824 to
4f5c06d
Compare
examples/llama/train_llama2.sh
Outdated
| fi | ||
| fi | ||
|
|
||
| if [ "$ENABLE_HSDP" -eq 1 ] && [ "$MEGATRON_FSDP" -ne 1 ]; then |
There was a problem hiding this comment.
I removed it before doing a rebase but didn't save the changes. It should be fixed right now, with the HSDP_NUM_DIST_OPT_INSTANCES parameter.
|
I do not see original changes that I commented in the PR anymore |
|
@ipanfilo The original changes should still be there. I just removed the |
I see only commits from starting Feb 25 but there are older conversations so there were commits before that that are not seen anymore. Please avoid force commits on PRs that are being reivewed |
examples/llama/train_llama2.sh
Outdated
| fi | ||
|
|
||
| if [ "$HSDP_NUM_DIST_OPT_INSTANCES" -gt 1 ] && [ "$MEGATRON_FSDP" -ne 1 ]; then | ||
| echo "Error: HSDP_NUM_DIST_OPT_INSTANCES>1 requires MEGATRON_FSDP=1" |
There was a problem hiding this comment.
yes, HSDP only works with Megatron FSDP enabled. We prompt the user to set it explicitly.
|
Streamlined the HSDP enablement with |
Motivation
Added HSDP support for Megatron FSDP
Technical Details
Fixes issue https://github.com/ROCm/frameworks-internal/issues/11723
Test Plan
Ran Llama3 and Llama2 examples with the new changes.
Test Result
All tests pass.
Submission Checklist