init_process_group not called when training on multiple-GPUs
#8517
-
|
Hi, I’m trying to train a model on 2 GPUs. I do this by specifying Trainer(..., gpus=2). ddp_spawn should automatically be selected for the method, but I instead get the following message + error: I looked at the source code of ddp_spawn and it looks like it should print out a message when initializing ddp, but it didn’t. Could I please have advice on how to correct this error. Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Found a relevant discussion, but I don't think it's applicable here because I construct the dataloader with: class CustomDataModule(pl.LightningDataModule):
def train_dataloader(self):
train_dataset = CustomDataset(
params=self.params,
data_params=self.data_params,
num_workers=self.num_workers,
)
return DataLoader(
train_dataset,
timeout=self.data_loader_timeout,
num_workers=self.num_workers,
batch_size=self.batch_size,
worker_init_fn=worker_init_fn,
)and here's what the order of operations looks like: data_module = CustomDataModule(...)
model = CustomLightningModule(...)
tb_logger = TensorBoardLogger(...)
checkpoint_callback = ModelCheckpoint(...)
trainer = Trainer.from_argparse_args(
args,
logger=tb_logger,
default_root_dir=args.output_dir,
profiler="pytorch", # tried removing this and it doesn't make a difference
callbacks=[checkpoint_callback],
gpus=args.gpus
)
trainer.fit(model, data_module) |
Beta Was this translation helpful? Give feedback.
-
|
The issue comes from the line File "train.py", line 173, in main
print(f"Logs for this experiment are being saved to {trainer.log_dir}")which tries to access
File ".../pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/properties.py", line 137, in log_dir
dirpath = self.accelerator.broadcast(dirpath)This is fixed in the 1.4 release as |
Beta Was this translation helpful? Give feedback.
The issue comes from the line
which tries to access
trainer.log_diroutside of the trainer scope.trainer.log_dirtries tobroadcastthe directory but fails as DDP hasn’t been initialized yet.This is fixed in the 1.4 release as
broadcastbecomes a no-op in that case