[BUG]: colossal cannot split tensor evenly when using Sequential Parallelism in hybirdplugin

### Is there an existing issue for this bug?

- [x] I have searched the existing issues

### The bug has not been fixed in the latest main branch

- [x] I have checked the latest main branch

### Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

No, I prefer not to share.

### 🐛 Describe the bug

I want to finetune a llama8b model using 4 H800, using SP = 4 in HybridParallelPlugin, I meet such a bug:
[rank3]: Traceback (most recent call last):
[rank3]:   File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 878, in <module>
[rank3]:     main()
[rank3]:   File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 874, in main
[rank3]:     train_model(model, train_loader, optimizer, scheduler, booster, tokenizer, args)
[rank3]:   File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 763, in train_model
[rank3]:     outputs = model(**batch)
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 222, in forward
[rank3]:     return super().forward(*args, **kwargs)
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/interface/model.py", line 127, in forward
[rank3]:     return self.module(*args, **kwargs)
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank3]:     return self._call_impl(*args, **kwargs)
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank3]:     return forward_call(*args, **kwargs)
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 329, in llama_for_causal_lm_forward
[rank3]:     outputs = LlamaPipelineForwards.llama_model_forward(
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 161, in llama_model_forward
[rank3]:     hidden_states = split_forward_gather_backward(
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 1363, in split_forward_gather_backward
[rank3]:     return _SplitForwardGatherBackward.apply(input_, dim, process_group, grad_scale, fp8_communication)
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank3]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 999, in forward
[rank3]:     return _split(input_, dim, process_group)
[rank3]:   File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 1199, in _split
[rank3]:     assert dim_size % world_size == 0, (
[rank3]: AssertionError: The dimension to split (1775) is not a multiple of world size (2), cannot split tensor evenly
I have padded the sequences to the length of 4096, but it seems that it will split my sequences in different lengths. So how can I fixed such a problem?

### Environment

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: colossal cannot split tensor evenly when using Sequential Parallelism in hybirdplugin #6381

Is there an existing issue for this bug?

The bug has not been fixed in the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

🐛 Describe the bug

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: colossal cannot split tensor evenly when using Sequential Parallelism in hybirdplugin #6381

Description

Is there an existing issue for this bug?

The bug has not been fixed in the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

🐛 Describe the bug

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions