Possible race condition when trying to save and access checkpoints with 12 GPUs across 2 nodes #6644
Unanswered
amorehead
asked this question in
DDP / multi-GPU / multi-node
Replies: 1 comment
-
Hey @amorehead, any updates on how you fixed this? I had a similar problem with the filesystem and race conditions. This doc is a pretty good resource to understand the race conditions, and even fix it with the low level Fabric API.
Does anyone know how to access the underlying fabric? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello.
I have recently been trying to get a simple MNIST image classifier up and running with Lightning and Weights and Biases's WandbLogger for Lightning, but I have been encountering an odd issue lately. When I launch my Lightning script on my GPU computing cluster with 2 GPU nodes with 6 GPUs each (i.e. DDP - 12 GPUs total, with 42 CPU cores allocated to each node as dataloader workers), my model completes a handful of epochs, say 3, before it initiates early stopping and begins to save "several" checkpoints. (As an aside, I was also curious why it looks like duplicate checkpoints are getting created for each GPU, but we can come back to this later.)
When my job goes to call Trainer.test(), it crashes saying that it was expecting a certain checkpoint to be written to disk already so it can load that one for testing, but the checkpoint does not appear to the underlying OS as though it was written beforehand, thus causing a FileNotFoundError. To me, unless I'm simply misunderstanding how to accommodate multiple GPUs and nodes for checkpointing followed by testing, this seems like it may be a race condition happening between job threads. Any thoughts or ideas as to what may be going on here? Below is my entire image classifier training script contained in a fork of the PL project template. Below that is my stack trace for the missing file error.
Also, thank you in advance for your time and help!
GitHub Fork of PL Project Template:
https://github.com/amorehead/deep-learning-hpc-project-template/blob/master/project/lit_image_classifier.py # training script
https://github.com/amorehead/deep-learning-hpc-project-template/blob/master/train.bash # script for executing lit_image_classifier.py on cluster
Stack Trace:
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/alpine/bip198/scratch/acmwhb/Repositories/Personal_Repositories/deep-learning-hpc-project-template/project/checkpoints/LitClassifier-epoch=03-val_au
roc=0.89-v6.ckpt'
Traceback (most recent call last):
File "/gpfs/alpine/bip198/scratch/acmwhb/Repositories/Personal_Repositories/deep-learning-hpc-project-template/project/lit_image_classifier.py", line 189, in
cli_main()
File "/gpfs/alpine/bip198/scratch/acmwhb/Repositories/Personal_Repositories/deep-learning-hpc-project-template/project/lit_image_classifier.py", line 179, in cli_main
result = trainer.test(test_dataloaders=test_loader)
File "/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Personal_Repositories/deep-learning-hpc-project-template/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 916, in test
results = self.__test_using_best_weights(ckpt_path, test_dataloaders)
File "/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Personal_Repositories/deep-learning-hpc-project-template/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 946, in _test_using
best_weights
ckpt = pl_load(ckpt_path, map_location=lambda storage, loc: storage)
File "/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Personal_Repositories/deep-learning-hpc-project-template/venv/lib/python3.6/site-packages/pytorch_lightning/utilities/cloud_io.py", line 31, in load
with fs.open(path_or_url, "rb") as f:
File "/ccs/home/acmwhb/.local/lib/python3.6/site-packages/fsspec/spec.py", line 943, in open
**kwargs,
File "/ccs/home/acmwhb/.local/lib/python3.6/site-packages/fsspec/implementations/local.py", line 118, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "/ccs/home/acmwhb/.local/lib/python3.6/site-packages/fsspec/implementations/local.py", line 200, in init
self._open()
File "/ccs/home/acmwhb/.local/lib/python3.6/site-packages/fsspec/implementations/local.py", line 205, in _open
self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/alpine/bip198/scratch/acmwhb/Repositories/Personal_Repositories/deep-learning-hpc-project-template/project/checkpoints/LitClassifier-epoch=03-val_au
roc=0.90-v6.ckpt'
Global seed set to 42
Traceback (most recent call last):
File "/gpfs/alpine/bip198/scratch/acmwhb/Repositories/Personal_Repositories/deep-learning-hpc-project-template/project/lit_image_classifier.py", line 189, in
cli_main()
File "/gpfs/alpine/bip198/scratch/acmwhb/Repositories/Personal_Repositories/deep-learning-hpc-project-template/project/lit_image_classifier.py", line 179, in cli_main
result = trainer.test(test_dataloaders=test_loader)
File "/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Personal_Repositories/deep-learning-hpc-project-template/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 916, in test
results = self.__test_using_best_weights(ckpt_path, test_dataloaders)
File "/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Personal_Repositories/deep-learning-hpc-project-template/venv/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 946, in _test_using
best_weights
ckpt = pl_load(ckpt_path, map_location=lambda storage, loc: storage)
File "/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Personal_Repositories/deep-learning-hpc-project-template/venv/lib/python3.6/site-packages/pytorch_lightning/utilities/cloud_io.py", line 31, in load
with fs.open(path_or_url, "rb") as f:
File "/ccs/home/acmwhb/.local/lib/python3.6/site-packages/fsspec/spec.py", line 943, in open
**kwargs,
File "/ccs/home/acmwhb/.local/lib/python3.6/site-packages/fsspec/implementations/local.py", line 118, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "/ccs/home/acmwhb/.local/lib/python3.6/site-packages/fsspec/implementations/local.py", line 200, in init
self._open()
File "/ccs/home/acmwhb/.local/lib/python3.6/site-packages/fsspec/implementations/local.py", line 205, in _open
self.f = open(self.path, mode=self.mode)
Beta Was this translation helpful? Give feedback.
All reactions