-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
Environment setup:
NGC torch 25.03 base image
latest arctic training (default deps)
llama 70b
zero 3 offload
single 8xH200 node
Observation, during the zero 3 offload init the cpu memory explodes up to 2TB and runs out of memory. Not happening on AT 591ddf5.
Metadata
Metadata
Assignees
Labels
No labels