bug(medcat): CU-869b2hpam Fix issue loading tokenizers off disk #213
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was an issue with loading tokenizers off disk in the correct manner. The underlying issue causes #211 .
The issue at hand had to do with how the saved tokenizer path was saved (incorrectly). The saved way it was previously saved, the tokenizer internals path includede the path to the model pack as well. And that maent that unless the folder structure is identical, the library would fail to load the spacy model. However, due to a fallback, it normally instead downloaded and/or used the same spacy model as installed locally.
However, with the dutch model, there was the additional inclusion of
confg.preprocessing.stopwords, which is something that gets added to the spacy language itself. And that's where it became clear that the above was happening because the path was incorrect.So this PR does 3 things:
The PR (currently) also adds a logged warning about this. Though this could potentially be better of as an
INFOmessage instead.