Questions about Fault-tolerant Training #10380
-
Hi! I'm working on a SLURM cluster with preemption, so I'm really excited to see the support of Fault-tolerant Training in 1.5.0. However, when I upgrade package and try I look into the code of However, when set some print in the code, when I use ctrl+C to interrupt code, it indeed goes to this Is there anything wrong with my code? I assume interruptions like EDIT: I've looked in the code a little bit more, it seems that when I do I think that's related to the signal that SLURM sent to my program, and I already see a |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Solved. That's indeed because in my SLURM cluster, there is no time interval between signal sending and program killing, so PyTorch-Lightning just don't have time to do checkpointing |
Beta Was this translation helpful? Give feedback.
Solved. That's indeed because in my SLURM cluster, there is no time interval between signal sending and program killing, so PyTorch-Lightning just don't have time to do checkpointing