[Bug]: Loading Model with own TrainerPlugin #3312

stefan-it · 2023-09-07T17:02:10Z

Describe the bug

Hi,

I was using the awesome new TrainerPlugin functionality and wrote an own plugin that reports GPU usage.

For that I placed my plugin under a plugins folder, created a gpu_stats.py with the following content:

import nvidia_smi
import logging

from flair.trainers.plugins.base import TrainerPlugin

logger = logging.getLogger("flair")


class GpuStatsPlugin(TrainerPlugin):
    def __init__(self) -> None:
        super().__init__()
        nvidia_smi.nvmlInit()

        # Always use first GPU
        handle = nvidia_smi.nvmlDeviceGetHandleByIndex(0)
        self.memory = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)

    @TrainerPlugin.hook
    def after_training_epoch(self, epoch, **kw):
        gpu_memory_used_mb = self.memory.used // 1024**2
        gpu_total_memory_mb = self.memory.total // 1024**2
        logger.info("GPU Memory Stats: {}MB / {}MB used".format(gpu_memory_used_mb, gpu_total_memory_mb))

Then I included it in my training code:

plugins = []

from plugins.gpu_stats import GpuStatsPlugin
plugins.append(GpuStatsPlugin())

Later in the code I called fine_tune function with:

trainer.fine_tune(
        output_path,
        learning_rate=learning_rate,
        mini_batch_size=batch_size,
        max_epochs=epoch,
        shuffle=True,
        embeddings_storage_mode='none',
        weight_decay=0.,
        use_final_model_for_eval=False,
        plugins=plugins,
    )

But then there's a problem when e.g. doing inferencing on the Model Hub:

https://huggingface.co/hmteams/flair-hipe-2022-hipe2020-fr

When you try to perform inferencing on the example sentence, the model is loading but then an erorr message is thrown:

Then I tried to load the model manually:

In [1]: from flair.models import SequenceTagger
2023-09-07 18:51:26.148340: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
^[[A
In [2]: tagger = SequenceTagger.load("hmteams/flair-hipe-2022-hipe2020-fr@eec764df9ac6ec5d7c573d510281a93aa4cf17ad")
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 1
----> 1 tagger = SequenceTagger.load("hmteams/flair-hipe-2022-hipe2020-fr")

File ~/.venvs/dev/lib/python3.11/site-packages/flair/models/sequence_tagger_model.py:1027, in SequenceTagger.load(cls, model_path)
   1023 @classmethod
   1024 def load(cls, model_path: Union[str, Path, Dict[str, Any]]) -> "SequenceTagger":
   1025     from typing import cast
-> 1027     return cast("SequenceTagger", super().load(model_path=model_path))

File ~/.venvs/dev/lib/python3.11/site-packages/flair/nn/model.py:537, in Classifier.load(cls, model_path)
    533 @classmethod
    534 def load(cls, model_path: Union[str, Path, Dict[str, Any]]) -> "Classifier":
    535     from typing import cast
--> 537     return cast("Classifier", super().load(model_path=model_path))

File ~/.venvs/dev/lib/python3.11/site-packages/flair/nn/model.py:163, in Model.load(cls, model_path)
    161 if not isinstance(model_path, dict):
    162     model_file = cls._fetch_model(str(model_path))
--> 163     state = load_torch_state(model_file)
    164 else:
    165     state = model_path

File ~/.venvs/dev/lib/python3.11/site-packages/flair/file_utils.py:352, in load_torch_state(model_file)
    348 # load_big_file is a workaround byhttps://github.com/highway11git
    349 # to load models on some Mac/Windows setups
    350 # see https://github.com/zalandoresearch/flair/issues/351
    351 f = load_big_file(model_file)
--> 352 return torch.load(f, map_location="cpu")

File ~/.venvs/dev/lib/python3.11/site-packages/torch/serialization.py:809, in load(f, map_location, pickle_module, weights_only, **pickle_load_args)
    807             except RuntimeError as e:
    808                 raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
--> 809         return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
    810 if weights_only:
    811     try:

File ~/.venvs/dev/lib/python3.11/site-packages/torch/serialization.py:1172, in _load(zip_file, map_location, pickle_module, pickle_file, **pickle_load_args)
   1170 unpickler = UnpicklerWrapper(data_file, **pickle_load_args)
   1171 unpickler.persistent_load = persistent_load
-> 1172 result = unpickler.load()
   1174 torch._utils._validate_loaded_sparse_tensors()
   1176 return result

File ~/.venvs/dev/lib/python3.11/site-packages/torch/serialization.py:1165, in _load.<locals>.UnpicklerWrapper.find_class(self, mod_name, name)
   1163         pass
   1164 mod_name = load_module_mapping.get(mod_name, mod_name)
-> 1165 return super().find_class(mod_name, name)

ModuleNotFoundError: No module named 'plugins'

It seems that the unpickling logic expects the same plugins folder structure. When I load the model within the folder structure that I used for training it, it is perfectly working.

So I think all passed plugins should not be saved/pickled when saving the model, because it would need the plugins code to be pickled...

To Reproduce

from flair.models import SequenceTagger

tagger = SequenceTagger.load("hmteams/flair-hipe-2022-hipe2020-fr@eec764df9ac6ec5d7c573d510281a93aa4cf17ad")

Expected behavior

Loading should work with latest Flair master version.

Logs and Stack traces

No response

Screenshots

No response

Additional Context

No response

Environment

In [5]: main()
#### Versions:
##### Flair
0.12.2
##### Pytorch
2.0.1+cu118
##### Transformers
4.33.0
#### GPU
True

The text was updated successfully, but these errors were encountered:

stefan-it · 2023-09-07T17:36:01Z

I could temporarily fix the error on Model Hub by loading the model (in original folder structure that was used in training) and manually adjust plugins array:

tagger.model_card["training_parameters"]["plugins"] = []

Then I saved the model again and uploaded it.

But for reproducibility, you can use the old commit to reproduce behavior:

from flair.models import SequenceTagger

tagger = SequenceTagger.load("hmteams/flair-hipe-2022-hipe2020-fr@eec764df9ac6ec5d7c573d510281a93aa4cf17ad")

helpmefindaname · 2023-09-11T08:58:07Z

Hi @stefan-it,

as discussed with @plonerma and @alanakbik, I suggest we will implement a state for plugins and store string-references instead of classes, so that the classes only need to be present when calling trainer.resume().

That means that hmteams/flair-hipe-2022-hipe2020-fr@eec764df9ac6ec5d7c573d510281a93aa4cf17ad will stay in its broken state, but new models won't be created that way.

helpmefindaname · 2023-10-02T15:28:00Z

slight update: trainer.resume() doesn't exist anymore since the plugin was introduced, so the fix will be reduced to just not pickeling classes. That slightly simplifies the implementation. Can be tested already on the linked PR

stefan-it added the bug Something isn't working label Sep 7, 2023

helpmefindaname self-assigned this Sep 11, 2023

helpmefindaname linked a pull request Oct 2, 2023 that will close this issue

don't pickle classes & plugins in modelcard #3325

Merged

alanakbik closed this as completed in #3325 Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Loading Model with own TrainerPlugin #3312

[Bug]: Loading Model with own TrainerPlugin #3312

stefan-it commented Sep 7, 2023 •

edited

Loading

stefan-it commented Sep 7, 2023 •

edited

Loading

helpmefindaname commented Sep 11, 2023

helpmefindaname commented Oct 2, 2023

[Bug]: Loading Model with own TrainerPlugin #3312

[Bug]: Loading Model with own TrainerPlugin #3312

Comments

stefan-it commented Sep 7, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Logs and Stack traces

Screenshots

Additional Context

Environment

stefan-it commented Sep 7, 2023 • edited Loading

helpmefindaname commented Sep 11, 2023

helpmefindaname commented Oct 2, 2023

stefan-it commented Sep 7, 2023 •

edited

Loading

stefan-it commented Sep 7, 2023 •

edited

Loading