Skip to content
Discussion options

You must be logged in to vote

Solved. That's indeed because in my SLURM cluster, there is no time interval between signal sending and program killing, so PyTorch-Lightning just don't have time to do checkpointing

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@tchaton
Comment options

Answer selected by Wuziyi616
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment