Why do we have the on_{validation/test/predict}_model_{train/eval} hooks? #8760

ananthsub · 2021-08-06T06:19:25Z

ananthsub
Aug 6, 2021

Why do we have the on_{validation/test/predict}_model_{train/eval} hooks?

https://github.com/PyTorchLightning/pytorch-lightning/blob/8473cf44ec0177ace2b17cf304a960e71b4d5fa6/pytorch_lightning/core/hooks.py#L192-L220

These were added back in #3858
#2551 (comment) contains the original design idea.

However, since the initial PR, these now have a default implementation on them to handle the wrapped model provided by the training type plugin. This implementation references the self.trainer.model. This now leaks abstractions for a number of reasons:

self.trainer doesn't exist on ModelHooks. This is set in the LightningModule, which subclasses ModelHooks. This means the ModelHooks is not self-contained.
Moreover, there's an effort to remove the full trainer reference from the LightningModule: Remove Trainer reference from lightning module and datamodule #7315
If users override these hooks, for example by calling self.eval() - they will run into bugs. Users are in essence forced to call super(). to stay correct. This is very fragile and hard to enforce.
The user shouldn't need to know about the wrapped module inside of the Trainer. This is an implementation detail of how the trainer supports DataParallel, DDP, ShardedDataParallel, etc.

Given the complexity that's grown in the Trainer, I think it'd be desirable to deprecate these hooks, and ask users to check the train/eval status of layers during the epoch start or end hooks.

@PyTorchLightning/core-contributors

carmocca · 2021-08-06T10:17:02Z

carmocca
Aug 6, 2021

As a note, some people override them to avoid changing the model state. One example:
#2551 (comment)

5 replies

ananthsub Aug 6, 2021
Author

@carmocca - since state changes on epoch start/end, wdyt of asking users to handle preserving or transitioning train/eval states in those hooks?

carmocca Aug 7, 2021

I think it would be possible to move them to the epoch start/end hooks, but

asking users to handle preserving or transitioning train/eval states in those hooks?

I don't think users should handle this at all unless they want to override the default behavior.

Setting the train/eval state is a perfect example of what lightning is supposed to extract away

ananthsub Aug 7, 2021
Author

I don't think users should handle this at all unless they want to override the default behavior. Setting the train/eval state is a perfect example of what lightning is supposed to extract away

I completely agree!

My proposal:

The trainer always calls self.model.train() / self.model/eval() inside the loops.
We deprecate the train/eval transition hooks. Therefore, there's no leakage of trainer to the user-side hooks, and even more critically, the wrapped module always stays in sync with what the user is doing. overriding these hooks can lead to subtle breakages if using other training type plugins.
In the exceptional case where users need to finely control this, users can reset the model state in the start/end hooks

awaelchli Aug 9, 2021

Note, this will potentially be a dangerous change. If users implement this hook today, they may do so specifically for a transition from e.g. validation to training.

If these hooks disappear it won't be possible anymore. 1) can't do it in on_validation_end because trainer will call model.train() right after that. 2) Can't do it in on_train_epoch_start since we don't know if we come from a validation that ran before. 3) Can't do it by overriding model.train() method because of the same reason as 2.

ananthsub Aug 26, 2021
Author

Can't do it in on_train_epoch_start since we don't know if we come from a validation that ran before. 3) Can't do it by overriding model.train() method because of the same reason as 2.

@awaelchli could users track this in the lightning module?

def on_validation_end(self):
    ...
    self._should_reset_train = True

def on_train_epoch_start(self):
    if self._should_reset_train:
        # handle

this is not clean either. but i am also seeing bugs where people override these hooks, which leads to inconsistent train/eval state across trainer.model and the lightning module (because the users don't know the lightning module is wrapped in another module)

would always calling trainer.model.train/eval in the loop, and then having separate hooks after for the lightning module be viable?

so it goes like this?

trainer.model.eval()
trainer.lightning_module.on_validation_model_eval()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do we have the on_{validation/test/predict}_model_{train/eval} hooks? #8760

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Why do we have the on_{validation/test/predict}_model_{train/eval} hooks? #8760

ananthsub Aug 6, 2021

Replies: 1 comment · 5 replies

carmocca Aug 6, 2021

ananthsub Aug 6, 2021 Author

carmocca Aug 7, 2021

ananthsub Aug 7, 2021 Author

awaelchli Aug 9, 2021

ananthsub Aug 26, 2021 Author

ananthsub
Aug 6, 2021

Replies: 1 comment 5 replies

carmocca
Aug 6, 2021

ananthsub Aug 6, 2021
Author

ananthsub Aug 7, 2021
Author

ananthsub Aug 26, 2021
Author