[OPT] Reduce update_weight peek memory in RL #1306

CyCle1024 · 2025-11-28T10:10:19Z

Key Change:

Clean extra index_select and cat logic in xtuner/v1/model/base.py:_get_fused_hf_param _get_hf_params in RL update weights.
Use EP shard to reduce peek memory in transferring fused big expert params in RL update weights.
Add Rollout param update_weight_bucket_size_in_gb to config pack tensor size in RL update weights.
test_update_weight default using ep_size=4.
When skip_load_weights == True in RolloutConfig, lmdeploy would load weights from training workers, this would save time in large scale cluster.

xtuner/v1/model/base.py

HAOCHENYE · 2025-11-28T10:20:35Z

xtuner/v1/model/base.py


        # Concatenate the tensors along the FSDP shard dim
        for tensors, size in zip(_fsdp_unsharded_tensor_list, origin_fsdp_size):
+            # special case for partition of tensors are contiguous


The comment should describe why rather than how

HAOCHENYE · 2025-11-28T10:35:41Z

xtuner/v1/model/base.py

+            if (
+                t.untyped_storage().data_ptr() != storage.data_ptr()
+                or t.dtype != dtype
+                or t.device != device
+                or t.stride()[1:] != inner_stride
+            ):
+                return None


If we made it a private function, we should remove some unnecessary check

HAOCHENYE · 2025-11-28T11:04:37Z

xtuner/v1/rl/base/worker.py

+                ep_mesh: DeviceMesh = model.ep_mesh
+                ep_size = ep_mesh.size()
+                ep_group = ep_mesh.get_group()
+                ep_rank = dist.get_rank(group=ep_group)


Add comments describing what happened here.

…k instead of local rank in pg

CyCle1024 requested a review from hhaAndroid November 28, 2025 10:10

hhaAndroid approved these changes Nov 28, 2025

View reviewed changes

HAOCHENYE reviewed Nov 28, 2025

View reviewed changes

hhaAndroid force-pushed the opt_update_weight_mem branch from 3bc7871 to 11fb109 Compare November 28, 2025 11:09

CyCle1024 requested a review from HAOCHENYE December 1, 2025 08:09

CyCle1024 force-pushed the opt_update_weight_mem branch 3 times, most recently from 803c5ef to 05df05c Compare December 5, 2025 09:11

CyCle1024 added 5 commits December 5, 2025 17:12

[OPT] Reduce update_weight peek memory in RL

ae8938e

[Fix] dist.broadcast/dist.broadcast_object_list should use global ran…

9d9b7f7

…k instead of local rank in pg

[Fix] use global rank in ep_size > 1 in _update_weights_hf_generator

05e78cb

[Docs] add docs for update_weights when ep_size > 1

e13dbc9

[Fix] change bucket_size_in_gb from rollout_cfg to train worker cfg

61e09a0

CyCle1024 force-pushed the opt_update_weight_mem branch from 05df05c to 61e09a0 Compare December 5, 2025 09:12

HAOCHENYE approved these changes Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OPT] Reduce update_weight peek memory in RL #1306

[OPT] Reduce update_weight peek memory in RL #1306

CyCle1024 commented Nov 28, 2025

Uh oh!

Uh oh!

Uh oh!

HAOCHENYE Nov 28, 2025

Uh oh!

HAOCHENYE Nov 28, 2025

Uh oh!

HAOCHENYE Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[OPT] Reduce update_weight peek memory in RL #1306

Are you sure you want to change the base?

[OPT] Reduce update_weight peek memory in RL #1306

Conversation

CyCle1024 commented Nov 28, 2025

Uh oh!

Uh oh!

Uh oh!

HAOCHENYE Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants