Questions about Fault-tolerant Training #10380

Wuziyi616 · 2021-11-05T20:43:27Z

Wuziyi616
Nov 5, 2021

Hi! I'm working on a SLURM cluster with preemption, so I'm really excited to see the support of Fault-tolerant Training in 1.5.0. However, when I upgrade package and try PL_FAULT_TOLERANT_TRAINING=1 python train.py xxx in the cluster, it doesn't seem to work.

I look into the code of Trainer, it seems that the code responsible for fault-tolerant is here. I assume preemption is a BaseException so the code will go to here and finally here so that we save a checkpoint?

However, when set some print in the code, when I use ctrl+C to interrupt code, it indeed goes to this KeyBoardInterrupt. But if I use scontrol requeue to simulate a preemption, the code didn't got to BaseException. And that's why it didn't save a checkpoint for Fault-tolerant Training.

Is there anything wrong with my code? I assume interruptions like scancel requeue are considered in this case. Can anyone help me? Thank you in advance!

EDIT: I've looked in the code a little bit more, it seems that when I do scancel or scontrol requeue, the code directly exit, without throwing an exception, and that's why it didn't go to the except _on_exception section. Is this expected behavior? Or is there anyway to solve it?

I think that's related to the signal that SLURM sent to my program, and I already see a SignalConnector dealing with SLURM in pytorch-lightning here. I also see this answer about the signal of SLURM. Maybe I should set it in the sbatch script? Any suggestions?

Answered by Wuziyi616

Nov 6, 2021

Solved. That's indeed because in my SLURM cluster, there is no time interval between signal sending and program killing, so PyTorch-Lightning just don't have time to do checkpointing

View full answer

Wuziyi616 · 2021-11-06T05:24:03Z

Wuziyi616
Nov 6, 2021
Author

Solved. That's indeed because in my SLURM cluster, there is no time interval between signal sending and program killing, so PyTorch-Lightning just don't have time to do checkpointing

1 reply

tchaton Nov 8, 2021
Maintainer

I don't have access to a SLURM cluster, so I can't test it out, but in theory, if you use PL_FAULT_TOLERANT_TRAINING=1, the previous behaviour with here should work as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about Fault-tolerant Training #10380

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Questions about Fault-tolerant Training #10380

Wuziyi616 Nov 5, 2021

Replies: 1 comment · 1 reply

Wuziyi616 Nov 6, 2021 Author

tchaton Nov 8, 2021 Maintainer

Wuziyi616
Nov 5, 2021

Replies: 1 comment 1 reply

Wuziyi616
Nov 6, 2021
Author

tchaton Nov 8, 2021
Maintainer