Skip to content

[BUG]: colossal cannot split tensor evenly when using Sequential Parallelism in hybirdplugin #6381

@Hugo-cell111

Description

@Hugo-cell111

Is there an existing issue for this bug?

  • I have searched the existing issues

The bug has not been fixed in the latest main branch

  • I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

No, I prefer not to share.

🐛 Describe the bug

I want to finetune a llama8b model using 4 H800, using SP = 4 in HybridParallelPlugin, I meet such a bug:
[rank3]: Traceback (most recent call last):
[rank3]: File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 878, in
[rank3]: main()
[rank3]: File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 874, in main
[rank3]: train_model(model, train_loader, optimizer, scheduler, booster, tokenizer, args)
[rank3]: File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 763, in train_model
[rank3]: outputs = model(**batch)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 222, in forward
[rank3]: return super().forward(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/interface/model.py", line 127, in forward
[rank3]: return self.module(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 329, in llama_for_causal_lm_forward
[rank3]: outputs = LlamaPipelineForwards.llama_model_forward(
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 161, in llama_model_forward
[rank3]: hidden_states = split_forward_gather_backward(
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 1363, in split_forward_gather_backward
[rank3]: return SplitForwardGatherBackward.apply(input, dim, process_group, grad_scale, fp8_communication)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank3]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 999, in forward
[rank3]: return split(input, dim, process_group)
[rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 1199, in _split
[rank3]: assert dim_size % world_size == 0, (
[rank3]: AssertionError: The dimension to split (1775) is not a multiple of world size (2), cannot split tensor evenly
I have padded the sequences to the length of 4096, but it seems that it will split my sequences in different lengths. So how can I fixed such a problem?

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions