FSDP training not loading saving the best checkpoint

Hi there!

I followed training a T5 model with FSDP on Sagemaker from the example `https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py`

I noticed that checkpointing is not done with `save_strategy="no"`. Is it intentional(line `https://github.com/huggingface/notebooks/blob/main/sagemaker/25_pytorch_fsdp_model_parallelism/scripts/run_clm.py#L93`)? In my training I changed it to `save_strategy="steps"` and noticed two issues

1. Best checkpoints based on min validation loss is not saved. If I set the limit to 2 for e.g., the last 2 checkpoints are saved
2. I was not able to load the trained model from checkpoint and got the error which is mentioned elsewhere in issues `RuntimeError: Trying to resize storage that is not resizable`. This does not happen if I want to load the final model. But it makes training hard since I need to know when to stop training so that I have the final model withe the minimum loss saved. I tried with different versions
```
PyTorch 1.13
Transformers 4.26
```

and 

```
PyTorch 2.0.0
Transformers 4.28.1
```

and see the same issue with loading a model from checkpoint.

Would appreciate any pointers

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP training not loading saving the best checkpoint #472

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FSDP training not loading saving the best checkpoint #472

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions