Skip to content

Conversation

ruisizhang123
Copy link
Contributor

@ruisizhang123 ruisizhang123 commented Oct 15, 2025

This PR adds support for aten-level manual bucketing in SimpleFSDP+aot_eager backend. Dependent on PyTorch PR

TODO List:

  • We should have better way of handling region info other than a list of str FQNs in current manual_bucketed_modules. It would be very easy to miss some of model modules. (cc. @xmfan @SherlockNoMad )
  • Currently, the reordering happens under the hood and overlap with last/next compute. We should allow users to specify which module they want to reorder.
  1. Performance (FSDP2 under eager mode, SimpleFSDP uses aot_eager backend)

Llama 3-8B

  • Single Node, 8 H100, Performance. (The slower TPS on Single Node is sort of as expected, since FSDP2 handles copy-in/out in two different streams, whereas SimpleFSDP handles copy-in/out in the same stream)
Method Parallelism Memory TPS Trace
SimpleFSDP FSDP=8 40.96GiB(43.12%) 7,227 LINK
FSDP2-eager FSDP=8 47.82GiB(50.35%) 7,380 LINK
FSDP2-aot_eager FSDP=8
SimpleFSDP FSDP=4 TP=2
FSDP2 FSDP=4 TP=2

Example SimpleFSDP 1D overlapping trace:

Screenshot 2025-10-16 at 10 49 55 AM
  • Bitwise Loss:

FSDP-only:
Screenshot 2025-10-17 at 10 41 56 AM

FSDP+TP:

DeepSeekV3-16B

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 15, 2025
@ruisizhang123 ruisizhang123 marked this pull request as draft October 15, 2025 17:41
@ruisizhang123 ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch from 5c035a9 to 0d66a82 Compare October 17, 2025 15:43
@ruisizhang123 ruisizhang123 force-pushed the ruisi/manual_bucket_pass branch from 0d66a82 to c20775e Compare October 17, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant