-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Description
Hi there!
I followed training a T5 model with FSDP on Sagemaker from the example https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py
I noticed that checkpointing is not done with save_strategy="no"
. Is it intentional(line https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py#L93
)? In my training I changed it to save_strategy="steps"
and noticed two issues
- Best checkpoints based on min validation loss is not saved. If I set the limit to 2 for e.g., the last 2 checkpoints are saved
- I was not able to load the trained model from checkpoint and got the error which is mentioned elsewhere in issues
RuntimeError: Trying to resize storage that is not resizable
. This does not happen if I want to load the final model. But it makes training hard since I need to know when to stop training so that I have the final model withe the minimum loss saved. I tried with different versions
PyTorch 1.13
Transformers 4.26
and
PyTorch 2.0.0
Transformers 4.28.1
and see the same issue with loading a model from checkpoint.
Would appreciate any pointers
Thank you!
Metadata
Metadata
Assignees
Labels
No labels