PyTorch Lightning + DeepSpeed: training “hangs” and OOMs when data loads — how to debug? (PL 2.5.4, CUDA 12.8, 5× Lovelace 46 GB) #21225
Unanswered
42elenz
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment
-
Hi @42elenz , Sorry you didn’t get much feedback earlier. A quick question, are you initializing your model on the meta device before handing it off to DeepSpeed? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all. I hope someone can help and has some ideas :) I’m hitting a wall trying to get PyTorch Lightning + DeepSpeed to run. My model initializes fine on one GPU. So the params themself seem to fit. I get an OOM because my input data is to big. So I tried to use Deepspeed 2 and 3 (even if I know 3 is probably an overkill). But there it starts two processes and then hangs (no forward progress). Maybe someone can point me to some helpful direction here?
The model always stops at this part:
initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Environment
GPUs: 5× Lovelace (46 GB each)
CUDA: 12.8
PyTorch Lightning: 2.5.4
Precision: 16-mixed
Strategy: DeepSpeed (tried ZeRO-2 and ZeRO-3)
Specifications: custom DataLoader; custom logic in on_validation_step etc.
System: VM. Have to "module load" cuda to have "CUDA_HOME" for example (Could that lead to errors?)
What I tried
DeepSpeed ZeRO stage 2 and stage 3 with CPU offload.
A custom PL strategy vs the plain "deepspeed" string.
Reducing global batch (via accumulation) to keep micro-batch tiny
Custom-Definition of strategy:
Beta Was this translation helpful? Give feedback.
All reactions