-
Notifications
You must be signed in to change notification settings - Fork 665
Description
I encounter below error when I using megatron to adapt my customized transformer to support model parallel based on the codegeex repo. But even I use torchrun with parameter such as "--standalone --nproc-per-node=1", it still occur the below error. And the most confusing phenomenon is that _PIPELINE_MODEL_PARALLEL_GROUP has been initialized in the code of initialize_model_parallel in megatron\mpu\initialize.py. So, if you know how to solve this problem, please give some advice, I will very appreciate it!
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/wangrui/animatediff/test_parallel_attention.py", line 189, in
[rank0]: pretrain(train_valid_test_datasets_provider, model_provider, forward_step)
[rank0]: File "/home/wangrui/miniforge3/envs/canvas/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
[rank0]: return f(*args, **kwargs)
[rank0]: File "/home/wangrui/animatediff/animatediff/megatron/training.py", line 144, in pretrain
[rank0]: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
[rank0]: File "/home/wangrui/animatediff/animatediff/megatron/training.py", line 402, in setup_model_and_optimizer
[rank0]: model = get_model(model_provider_func)
[rank0]: File "/home/wangrui/animatediff/animatediff/megatron/training.py", line 274, in get_model
[rank0]: model = model_provider_func(pre_process=pre_process, post_process=post_process)
[rank0]: File "/home/wangrui/animatediff/test_parallel_attention.py", line 67, in model_provider
[rank0]: model = ParallelAttention(init_method_normal(args.init_method_std),None,args.num_layers)
[rank0]: File "/home/wangrui/animatediff/animatediff/diffusers/models/attention_processor.py", line 933, in init
[rank0]: world_size = mpu.get_model_parallel_world_size()
[rank0]: File "/home/wangrui/animatediff/animatediff/parallel_animatediff/mpu/initialize.py", line 251, in get_model_parallel_world_size
[rank0]: get_pipeline_model_parallel_world_size() == 1
[rank0]: File "/home/wangrui/animatediff/animatediff/parallel_animatediff/mpu/initialize.py", line 261, in get_pipeline_model_parallel_world_size
[rank0]: return torch.distributed.get_world_size(group=get_pipeline_model_parallel_group())
[rank0]: File "/home/wangrui/animatediff/animatediff/parallel_animatediff/mpu/initialize.py", line 211, in get_pipeline_model_parallel_group
[rank0]: assert (
[rank0]: AssertionError: pipeline_model parallel group is not initialized