bug(medcat): CU-869b2hpam Fix issue loading tokenizers off disk #213

mart-r · 2025-11-05T11:55:37Z

There was an issue with loading tokenizers off disk in the correct manner. The underlying issue causes #211 .

The issue at hand had to do with how the saved tokenizer path was saved (incorrectly). The saved way it was previously saved, the tokenizer internals path includede the path to the model pack as well. And that maent that unless the folder structure is identical, the library would fail to load the spacy model. However, due to a fallback, it normally instead downloaded and/or used the same spacy model as installed locally.

However, with the dutch model, there was the additional inclusion of confg.preprocessing.stopwords, which is something that gets added to the spacy language itself. And that's where it became clear that the above was happening because the path was incorrect.

So this PR does 3 things:

Makes sure the tokenizer path is correct at model save time
- I.e only includes the path relative from the model pack root
- This ensure future models save this in the correct format
Reconfigures the tokenizers internals path at model load time
- I.e uses only the base name (the last folder name) in combination of the model load path
- This ensures models saved in the incorrect format are loaded correctly
Fixes the extracting of the language string from the spacy internals path
- This didn't consider the prefix used for the saved internal state

The PR (currently) also adds a logged warning about this. Though this could potentially be better of as an INFO message instead.

…tokenizer path

…hen setting stopwords

…ath to it)

…als path

tomolopolis · 2025-11-05T11:55:41Z

Task linked: CU-869b2hpam Fix issue with loading tokenizer off disk

mart-r added 5 commits November 5, 2025 11:38

CU-869b2hpam: Make tokenizer init use model load path to amend saved …

fedfc5c

…tokenizer path

CU-869b2hpam: Fix issue with lang string from tokenizer folder name w…

4ddc903

…hen setting stopwords

CU-869b2hpam: Make tokenizer saving return subfolder name only (not p…

27a42dd

…ath to it)

CU-869b2hpam: Minor whitespace changes

18c7d69

CU-869b2hpam: Add a logged warning when fixing the spacy model intern…

e011351

…als path

mart-r mentioned this pull request Nov 5, 2025

ModuleNotFoundError: No module named 'spacy.lang.tokenizer' #211

Closed

mart-r added 2 commits November 5, 2025 13:12

CU-869b2hpam: Fix tests regarding spacy model save

d39d7b6

CU-869b2hpam: Fix tests regarding config merge

0eafc32

adam-sutton-1992 self-requested a review November 10, 2025 16:09

adam-sutton-1992 assigned adam-sutton-1992 and unassigned adam-sutton-1992 Nov 10, 2025

adam-sutton-1992 approved these changes Nov 10, 2025

View reviewed changes

mart-r merged commit b7711c8 into main Nov 10, 2025
21 of 22 checks passed

mart-r deleted the bug/medcat/CU-869b2hpam-fix-issue-loading-tokenizers-off-disk branch November 10, 2025 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug(medcat): CU-869b2hpam Fix issue loading tokenizers off disk #213

bug(medcat): CU-869b2hpam Fix issue loading tokenizers off disk #213

Uh oh!

mart-r commented Nov 5, 2025

Uh oh!

tomolopolis commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bug(medcat): CU-869b2hpam Fix issue loading tokenizers off disk #213

bug(medcat): CU-869b2hpam Fix issue loading tokenizers off disk #213

Uh oh!

Conversation

mart-r commented Nov 5, 2025

Uh oh!

tomolopolis commented Nov 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants