-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
Is there an existing issue for this bug?
- I have searched the existing issues
The bug has not been fixed in the latest main branch
- I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
No, I prefer not to share.
🐛 Describe the bug
I want to finetune a llama8b model using 4 H800, using SP = 4 in HybridParallelPlugin, I meet such a bug:
[rank3]: Traceback (most recent call last):
[rank3]: File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 878, in
[rank3]: main()
[rank3]: File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 874, in main
[rank3]: train_model(model, train_loader, optimizer, scheduler, booster, tokenizer, args)
[rank3]: File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 763, in train_model
[rank3]: outputs = model(**batch)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 222, in forward
[rank3]: return super().forward(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/interface/model.py", line 127, in forward
[rank3]: return self.module(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 329, in llama_for_causal_lm_forward
[rank3]: outputs = LlamaPipelineForwards.llama_model_forward(
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 161, in llama_model_forward
[rank3]: hidden_states = split_forward_gather_backward(
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 1363, in split_forward_gather_backward
[rank3]: return SplitForwardGatherBackward.apply(input, dim, process_group, grad_scale, fp8_communication)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank3]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 999, in forward
[rank3]: return split(input, dim, process_group)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 1199, in _split
[rank3]: assert dim_size % world_size == 0, (
[rank3]: AssertionError: The dimension to split (1775) is not a multiple of world size (2), cannot split tensor evenly
I have padded the sequences to the length of 4096, but it seems that it will split my sequences in different lengths. So how can I fixed such a problem?
Environment
No response