Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Loading Model with own TrainerPlugin #3312

Closed
stefan-it opened this issue Sep 7, 2023 · 3 comments · Fixed by #3325
Closed

[Bug]: Loading Model with own TrainerPlugin #3312

stefan-it opened this issue Sep 7, 2023 · 3 comments · Fixed by #3325
Assignees
Labels
bug Something isn't working

Comments

@stefan-it
Copy link
Member

stefan-it commented Sep 7, 2023

Describe the bug

Hi,

I was using the awesome new TrainerPlugin functionality and wrote an own plugin that reports GPU usage.

For that I placed my plugin under a plugins folder, created a gpu_stats.py with the following content:

import nvidia_smi
import logging

from flair.trainers.plugins.base import TrainerPlugin

logger = logging.getLogger("flair")


class GpuStatsPlugin(TrainerPlugin):
    def __init__(self) -> None:
        super().__init__()
        nvidia_smi.nvmlInit()

        # Always use first GPU
        handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
        self.memory = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

    @TrainerPlugin.hook
    def after_training_epoch(self, epoch, **kw):
        gpu_memory_used_mb = self.memory.used // 1024**2
        gpu_total_memory_mb = self.memory.total // 1024**2
        logger.info("GPU Memory Stats: {}MB / {}MB used".format(gpu_memory_used_mb, gpu_total_memory_mb))

Then I included it in my training code:

plugins = []

from plugins.gpu_stats import GpuStatsPlugin
plugins.append(GpuStatsPlugin())

Later in the code I called fine_tune function with:

trainer.fine_tune(
        output_path,
        learning_rate=learning_rate,
        mini_batch_size=batch_size,
        max_epochs=epoch,
        shuffle=True,
        embeddings_storage_mode='none',
        weight_decay=0.,
        use_final_model_for_eval=False,
        plugins=plugins,
    )

But then there's a problem when e.g. doing inferencing on the Model Hub:

https://huggingface.co/hmteams/flair-hipe-2022-hipe2020-fr

When you try to perform inferencing on the example sentence, the model is loading but then an erorr message is thrown:

image

Then I tried to load the model manually:

In [1]: from flair.models import SequenceTagger
2023-09-07 18:51:26.148340: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
^[[A
In [2]: tagger = SequenceTagger.load("hmteams/flair-hipe-2022-hipe2020-fr@eec764df9ac6ec5d7c573d510281a93aa4cf17ad")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 tagger = SequenceTagger.load("hmteams/flair-hipe-2022-hipe2020-fr")

File ~/.venvs/dev/lib/python3.11/site-packages/flair/models/sequence_tagger_model.py:1027, in SequenceTagger.load(cls, model_path)
   1023 @classmethod
   1024 def load(cls, model_path: Union[str, Path, Dict[str, Any]]) -> "SequenceTagger":
   1025     from typing import cast
-> 1027     return cast("SequenceTagger", super().load(model_path=model_path))

File ~/.venvs/dev/lib/python3.11/site-packages/flair/nn/model.py:537, in Classifier.load(cls, model_path)
    533 @classmethod
    534 def load(cls, model_path: Union[str, Path, Dict[str, Any]]) -> "Classifier":
    535     from typing import cast
--> 537     return cast("Classifier", super().load(model_path=model_path))

File ~/.venvs/dev/lib/python3.11/site-packages/flair/nn/model.py:163, in Model.load(cls, model_path)
    161 if not isinstance(model_path, dict):
    162     model_file = cls._fetch_model(str(model_path))
--> 163     state = load_torch_state(model_file)
    164 else:
    165     state = model_path

File ~/.venvs/dev/lib/python3.11/site-packages/flair/file_utils.py:352, in load_torch_state(model_file)
    348 # load_big_file is a workaround byhttps://github.com/highway11git
    349 # to load models on some Mac/Windows setups
    350 # see https://github.com/zalandoresearch/flair/issues/351
    351 f = load_big_file(model_file)
--> 352 return torch.load(f, map_location="cpu")

File ~/.venvs/dev/lib/python3.11/site-packages/torch/serialization.py:809, in load(f, map_location, pickle_module, weights_only, **pickle_load_args)
    807             except RuntimeError as e:
    808                 raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
--> 809         return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
    810 if weights_only:
    811     try:

File ~/.venvs/dev/lib/python3.11/site-packages/torch/serialization.py:1172, in _load(zip_file, map_location, pickle_module, pickle_file, **pickle_load_args)
   1170 unpickler = UnpicklerWrapper(data_file, **pickle_load_args)
   1171 unpickler.persistent_load = persistent_load
-> 1172 result = unpickler.load()
   1174 torch._utils._validate_loaded_sparse_tensors()
   1176 return result

File ~/.venvs/dev/lib/python3.11/site-packages/torch/serialization.py:1165, in _load.<locals>.UnpicklerWrapper.find_class(self, mod_name, name)
   1163         pass
   1164 mod_name = load_module_mapping.get(mod_name, mod_name)
-> 1165 return super().find_class(mod_name, name)

ModuleNotFoundError: No module named 'plugins'

It seems that the unpickling logic expects the same plugins folder structure. When I load the model within the folder structure that I used for training it, it is perfectly working.

So I think all passed plugins should not be saved/pickled when saving the model, because it would need the plugins code to be pickled...

To Reproduce

from flair.models import SequenceTagger

tagger = SequenceTagger.load("hmteams/flair-hipe-2022-hipe2020-fr@eec764df9ac6ec5d7c573d510281a93aa4cf17ad")

Expected behavior

Loading should work with latest Flair master version.

Logs and Stack traces

No response

Screenshots

No response

Additional Context

No response

Environment

In [5]: main()
#### Versions:
##### Flair
0.12.2
##### Pytorch
2.0.1+cu118
##### Transformers
4.33.0
#### GPU
True
@stefan-it stefan-it added the bug Something isn't working label Sep 7, 2023
@stefan-it
Copy link
Member Author

stefan-it commented Sep 7, 2023

I could temporarily fix the error on Model Hub by loading the model (in original folder structure that was used in training) and manually adjust plugins array:

tagger.model_card["training_parameters"]["plugins"] = []

Then I saved the model again and uploaded it.

But for reproducibility, you can use the old commit to reproduce behavior:

from flair.models import SequenceTagger

tagger = SequenceTagger.load("hmteams/flair-hipe-2022-hipe2020-fr@eec764df9ac6ec5d7c573d510281a93aa4cf17ad")

@helpmefindaname
Copy link
Collaborator

Hi @stefan-it,

as discussed with @plonerma and @alanakbik, I suggest we will implement a state for plugins and store string-references instead of classes, so that the classes only need to be present when calling trainer.resume().

That means that hmteams/flair-hipe-2022-hipe2020-fr@eec764df9ac6ec5d7c573d510281a93aa4cf17ad will stay in its broken state, but new models won't be created that way.

@helpmefindaname helpmefindaname self-assigned this Sep 11, 2023
@helpmefindaname helpmefindaname linked a pull request Oct 2, 2023 that will close this issue
@helpmefindaname
Copy link
Collaborator

slight update: trainer.resume() doesn't exist anymore since the plugin was introduced, so the fix will be reduced to just not pickeling classes. That slightly simplifies the implementation. Can be tested already on the linked PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants