PyTorch Lightning + DeepSpeed: training “hangs” and OOMs when data loads — how to debug? (PL 2.5.4, CUDA 12.8, 5× Lovelace 46 GB) #21225

42elenz · 2025-09-16T15:24:30Z

42elenz
Sep 16, 2025

Hi all. I hope someone can help and has some ideas :) I’m hitting a wall trying to get PyTorch Lightning + DeepSpeed to run. My model initializes fine on one GPU. So the params themself seem to fit. I get an OOM because my input data is to big. So I tried to use Deepspeed 2 and 3 (even if I know 3 is probably an overkill). But there it starts two processes and then hangs (no forward progress). Maybe someone can point me to some helpful direction here?
The model always stops at this part:
initializing deepspeed distributed: GLOBAL_RANK: 1, MEMBER: 2/2

Environment
GPUs: 5× Lovelace (46 GB each)

CUDA: 12.8

PyTorch Lightning: 2.5.4

Precision: 16-mixed

Strategy: DeepSpeed (tried ZeRO-2 and ZeRO-3)

Specifications: custom DataLoader; custom logic in on_validation_step etc.

System: VM. Have to "module load" cuda to have "CUDA_HOME" for example (Could that lead to errors?)

What I tried
DeepSpeed ZeRO stage 2 and stage 3 with CPU offload.

A custom PL strategy vs the plain "deepspeed" string.

Reducing global batch (via accumulation) to keep micro-batch tiny

Custom-Definition of strategy:

ds_cfg = {
  "train_batch_size": 2,                 
  "gradient_accumulation_steps": 8,     
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": True,
    "contiguous_gradients": True,
    "offload_param":     {"device": "cpu", "pin_memory": True},
    "offload_optimizer": {"device": "cpu", "pin_memory": True}
  },
  "activation_checkpointing": {
    "partition_activations": True,
    "contiguous_memory_optimization": True,
    "cpu_checkpointing": False
  },
  # Avoid AIO since we disabled its build
  "aio": {"block_size": 0, "queue_depth": 0, "single_submit": False, "overlap_events": False},
  "zero_allow_untested_optimizer": True
}

strategy_lightning = pl.strategies.DeepSpeedStrategy(config=ds_cfg)


# Since the strategy didn't work now my Trainer looks like this:
trainer = pl.Trainer(
                max_epochs=self.main_epochs,
                accelerator='gpu' if torch.cuda.is_available() else 'cpu',
                devices=devices,
                strategy='deepspeed_stage_3' if devices > 1 else 'auto',
                #sync_batchnorm=True if devices > 1 else False,
                log_every_n_steps=5,
                val_check_interval=1.0,
                precision='16-mixed',  
                gradient_clip_val=1.0,  
                accumulate_grad_batches=1, 
                enable_checkpointing=False, 
                enable_model_summary=False,  
)

deependujha · 2025-09-24T05:02:37Z

deependujha
Sep 24, 2025
Collaborator

Hi @42elenz , Sorry you didn’t get much feedback earlier.

A quick question, are you initializing your model on the meta device before handing it off to DeepSpeed?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyTorch Lightning + DeepSpeed: training “hangs” and OOMs when data loads — how to debug? (PL 2.5.4, CUDA 12.8, 5× Lovelace 46 GB) #21225

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PyTorch Lightning + DeepSpeed: training “hangs” and OOMs when data loads — how to debug? (PL 2.5.4, CUDA 12.8, 5× Lovelace 46 GB) #21225

Uh oh!

Uh oh!

42elenz Sep 16, 2025

Replies: 1 comment

Uh oh!

deependujha Sep 24, 2025 Collaborator

42elenz
Sep 16, 2025

deependujha
Sep 24, 2025
Collaborator