-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TrainingEpochLoop._should_check_val_fx
discrepancy between continued run <> restore from ckpt
#14579
Comments
I'd be happy to help fix if agreed this is a bug, but like the opinion + take on a solution from folks more involved with pytorch lightning.
|
Hi, @Anner-deJong - Thank you for creating this issue, and helping with all the context around this. Probably @awaelchli or @carmocca can help you on this one. |
Getting back to this issue, as it came up recently in a different context. This is definitely a behavior that needs fixing. My take would be to not add an extra hook, but invoke The sequence would go from: self.batch_progress.increment_ready()
on_train_batch_start
self.batch_progress.increment_started()
...
self.batch_progress.increment_processed()
...
on_train_batch_end
self.batch_progress.increment_completed() to self.batch_progress.increment_ready()
on_train_batch_start
self.batch_progress.increment_started()
...
self.batch_progress.increment_processed()
...
self.batch_progress.increment_completed()
on_train_batch_end Not sure what do think about the |
On further thought, I'll take a more conservative approach since the implications are pretty wide. |
TrainingEpochLoop._should_check_val_fx
discrepancy between continued run <> restore from ckpt
🐛 Bug
Found a discrepancy between a continued run after checkpointing, and restoring from checkpoint
Observation:
training_batch / val_loop ordering upon checkpoint restoration not the same as original run after checkpoint saving.
There are still the same amount of train steps, but the validation loops are interleaved at a single step later, which can cause the restored run to end up with one less validation loop (see colab)
Assumption / expectation:
Zero difference between a training run after a checkpoint and a run continued from said checkpoint
Investigation so far:
Im new to some of this lightning code, but IIUC:
Key:
TrainingEpochLoop
'sself.batch_progress.increment_completed()
is called after"on_train_batch_end"
hooks, the latter kicking off checkpoint saving.TrainingEpochLoop.batch_progress.current.reset_on_restart()
will reset theready
back tocompleted
global_step
, which refers toTrainingEpochLoop.batch_loop.optimizer_loop.optim_progress.optimizer.step.total.completed
, has beenincrement_completed()
(called withinTrainingEpochLoop.batch_loop.run
) and thus upon restoring,..optimizer.step.total.ready
is set to an up to dateoptimizer.step.total.completed
, out of sync with the aboveval_check_interval
mode", validation is triggered whenTrainingEpochLoop.batch_progress.current.ready % val_check_interval == 0
(throughTrainingEpochLoop.on_advance_end
->TrainingEpochLoop._should_check_val_fx
batch_progress.current
ready
/completed
counter for the continued and restored runs, end up aligned with differentglobal_step
s, and hence validation triggers at differentglobal_step
sAnother observation:
The following if statement seems to allow for a zero-difference restart, except that just like 4. above,
_should_check_val_fx
wouldnt trigger where in the original run on the checkpointing step it did (although there called inon_advance_end
). Not sure if the original intention of this snippet included the current scopePR's relevant to this line:
Potential impact:
Assuming not too worrisome for the more default Lightning use cases:
val_check_interval
>> 3 (colab example = 3), or that turned off relying instead oncheck_val_every_n_epoch
However, in theory it can influence all of the following:
To Reproduce
customized google colab
bug_report_model.ipynb
with same observation onBoringModel
Expected behavior
Zero difference between a training run continued after a checkpoint and a run continued from said checkpoint
Environment
Note:
v1.6.4
.v1.7.4
master
branch last weeks and the relevant code seems unchangedDetails
CUDA:
Lightning:
Packages:
System:
Additional context
cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @carmocca @justusschock
The text was updated successfully, but these errors were encountered: