You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, Trainer requires num_nodes and devices, but this may be different across nodes. For instance, slurm may provide 1 node with 6 gpus, and 2 other nodes with 1 gpu each, for a total of 8 nodes. Right now, it gives the following error:
..../python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 118, in root_device
return self.parallel_devices[self.local_rank]
IndexError: list index out of range
srun: error: <node-name>: tasks 6-7: Exited with exit code 1
Ideally, the world size should be provided by cluster environment, and the trainer should create subprocesses only based on number of gpus available in current node.
Yes, this is a known limitation currently. While it was a true limitation in the past, today it is somewhat artificial.
I opened a proposal #14078 which should pave the way to remove this limitation eventually.
After #14078, you would simply set devices="auto" or devices=-1 and then the actual number of devices can be different per node.
I'm removing the bug label because this can't really be delivered as a bug fix, and depends on the decision in #14078.
🐛 Bug
Currently,
Trainer
requiresnum_nodes
anddevices
, but this may be different across nodes. For instance, slurm may provide 1 node with 6 gpus, and 2 other nodes with 1 gpu each, for a total of 8 nodes. Right now, it gives the following error:To Reproduce
Note: SL_NUM_NODES being set externally
And here is the slurm script (need to add , ,
Expected behavior
Ideally, the world size should be provided by cluster environment, and the trainer should create subprocesses only based on number of gpus available in current node.
Environment
cc @awaelchli @tchaton @rohitgr7 @justusschock @kaushikb11 @akihironitta
The text was updated successfully, but these errors were encountered: