Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Nov 5, 2025

There was an issue with loading tokenizers off disk in the correct manner. The underlying issue causes #211 .

The issue at hand had to do with how the saved tokenizer path was saved (incorrectly). The saved way it was previously saved, the tokenizer internals path includede the path to the model pack as well. And that maent that unless the folder structure is identical, the library would fail to load the spacy model. However, due to a fallback, it normally instead downloaded and/or used the same spacy model as installed locally.

However, with the dutch model, there was the additional inclusion of confg.preprocessing.stopwords, which is something that gets added to the spacy language itself. And that's where it became clear that the above was happening because the path was incorrect.

So this PR does 3 things:

  • Makes sure the tokenizer path is correct at model save time
    • I.e only includes the path relative from the model pack root
    • This ensure future models save this in the correct format
  • Reconfigures the tokenizers internals path at model load time
    • I.e uses only the base name (the last folder name) in combination of the model load path
    • This ensures models saved in the incorrect format are loaded correctly
  • Fixes the extracting of the language string from the spacy internals path
    • This didn't consider the prefix used for the saved internal state

The PR (currently) also adds a logged warning about this. Though this could potentially be better of as an INFO message instead.

@tomolopolis
Copy link
Member

@mart-r mart-r merged commit b7711c8 into main Nov 10, 2025
21 of 22 checks passed
@mart-r mart-r deleted the bug/medcat/CU-869b2hpam-fix-issue-loading-tokenizers-off-disk branch November 10, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants