Embedding Linker using MLM based embeddings #65

adam-sutton-1992 · 2025-08-04T15:04:33Z

This is WIP just showing the inital commit of the embedding linker, and how I've tested / instantiated it.

WHAT IT DOES:

This is a linker that requires no training. However it does have higher computational overheads.

Given an embedding model name (such as sentence-transformers/all-MiniLM-L6-v2 or abhinand/MedEmbed-small-v0.1) we embed all names in cdb.name2info. This will result in a N dimensional vector for each name.

When provided with a bunch of entities from the ner step we compare their contexts to predict the appropriate CUI.

HOW IT'S BEEN TESTED:

I am currently changing the linker in an already existing model:

cat = CAT.load_model_pack("/data/adam/models/kch_gstt_v2")
register_core_component(CoreComponentType.linking, EmbeddingLinker.name, EmbeddingLinker.create_new_component)

cat.config.components.linking = EmbeddingLinking()
cat.config.components.linking.comp_name = EmbeddingLinker.name

embedding_linker = EmbeddingLinker(
    cdb=cat.cdb,
    config=cat.config
)

cat._pipeline._components[-1] = embedding_linker

cat._recreate_pipe()

embedding_linker.create_embeddings(embedding_model_name="abhinand/MedEmbed-small-v0.1")

cat.save_model_pack(target_folder="/data/adam/models/",
                    pack_name="kch_gstt_v2_MedEmbed_RAW",
                    add_hash_to_pack_name=False,
                    make_archive=False
)

I then run this to test performance:

from medcat.cat import CAT
from medcat.components.types import register_core_component
from medcat.components.types import CoreComponentType
from medcat.components.linking.embedding_linker import Linker as EmbeddingLinker
from medcat.stats.stats import get_stats
from medcat.data.mctexport import MedCATTrainerExport

register_core_component(CoreComponentType.linking, EmbeddingLinker.name, EmbeddingLinker.create_new_component)
cat = CAT.load_model_pack("/data/adam/models/kch_gstt_v2_embedding_linker")

**loading distemist and snomed_ct test sets**

all_projects = MedCATTrainerExport(projects=[])
all_projects["projects"].append(distemist)
all_projects["projects"].append(snomed_ct)
get_stats(cat, all_projects)

I've tested this on the original model, and twice with the new embedding linker using "abhinand/MedEmbed-small-v0.1" as the embedding model. While the model can use link_candidates provided by the vocab based ner step, it can also generate its own via a config for short context windows for a slight increase in performance.

Using context_based_linker:
Epoch: 0, Prec: 0.09023910064860668, Rec: 0.3306035738148675, F1: 0.14177920150470955

Using embedding linker:
Epoch: 0, Prec: 0.08745465862505634, Rec: 0.34906193780519146, F1: 0.13986681312645888

The only real important stat here is recall. While these look low, I'm led to believe that these usually come with filters, which I've avoided for a fairer comparison.

mart-r

Overall, this sounds great!

I've just left "a few" comments on things I'm getting confused about.

medcat-v2/medcat/components/linking/embedding_linker.py

adam-sutton-1992 · 2025-08-21T22:51:15Z

TODO or discuss currently that I know of:

Initialising for Inference I think having it embed upon init might not be the best way, as it can be time consuming. Especially if you're planning to change the embedding model. Altough handling is_dirty gracefuly is also required.
Testing with removal of stop words / punctuation.
Best way to embed names with seperators gracefully via reverse engineering the method or a suitable compromise.
Optimal settings in the config as a base.
Optimal similarity_threshold for precision / recall / f1.

Otherwise the changes I've submitted have done a few things:

Added generating link candidates via names. If you choose to ignore candidates provided by the ner step, or have a detected entity with no candidates some are generated. This slightly improves performance in both cases, and is even better when combined with filtering.
Removed infer by cui. With the generation of link candidates via names it feels quite obsolete. We only infer via cui names when disambiguating based on the longest name OR the preferred name (which can do with a bit more testing).
Various bug fixes and improvements as discussed here.

mart-r

Some small comments on the code

medcat-v2/medcat/components/linking/embedding_linker.py

mart-r · 2025-08-22T09:47:06Z

Initialising for Inference I think having it embed upon init might not be the best way, as it can be time consuming. Especially if you're planning to change the embedding model. Altough handling is_dirty gracefuly is also required.

You could also init at __call__ time when it's first needed. But that's really quite unintuitive since it would mean the first inference call would take "forever" - even if it was for one small piece of text. Which could cause the user to think it's broken.
Like I said before though - as long as you provide a tutorial alongside that shows what the initial step(s) are to get this saved on disk, it'll probably be fine. So if the initial model creation involves an additional step (after setting config and stuff) to do the embedding, and save the model after that, I'd say that would be good enoguh.

As for is_dirty, I think the best you can do is to check for the flag and log a warning. Because you could recalculate at __call__ time if you find the CDB to be dirty, but this may - again - take so long that the user thinks the whole thing is broken. Though if the change is addative (in terms of CUIs or or names) then you should be able to only embed the bits that were added. But things may not be as easy if there's stuff removed (though probably still technically doable).
The other option would be to overwrite some of the CDB's methods (to add / remove cuis / names) and either log a warning at that time or even raise an exception (this could also be a configuration option).

Best way to embed names with seperators gracefully via reverse engineering the method or a suitable compromise.

I don't think there's really a way to do that gracefully. This would be a generative process - you're wanting to generate information that isn't there.
The only real way I could see this working is if you had the SNOMED release (and the UMLS bits that were used for enrichment) and then just mapped all the processed names to their originals. You could do this as a one time thing on a per model basis, but it'd be tedious to do in a general manner.

Overall, you migth also want to add some tests to this. Probably with a very low-performant but small model. But something that shows the entire thing works in the technical sense.

adam-sutton-1992 · 2025-08-22T15:45:34Z

I've just this in the call method:

if self.cdb.is_dirty:
logging.warning("CDB has been modified since last save/load. This might significantly affect linking performance.")
logging.warning("If you have added new concepts or changes, please re-embed the CDB names and cuis before linking.")

I think there are tons are permutations I can do around being smart with embedding names and cuis. But I think it might just not be worth it.

mart-r · 2025-08-22T15:49:47Z

I think that's fair.

Though maybe also tell them the method to use for the re-embedding. I.e:

linker = self.cat._pipeline.get_component(CoreComponentType.linking)
linker.embed_cui_names(linker.embedding_model_name)

PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute.

adam-sutton-1992 · 2025-08-28T18:30:43Z

Hihi,

Additional changes:

first bits of unit testing.
Fixes based on feedback from said unit testing :)
Optimal hyper-params for the linker including (including stop words, context window length, etc.).

Regarding:

PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute.

I was curious what's the best way to access the embed linker, as it's needed to be called when embedding names and cuis?

Currently I access the component via: cat._pipeline._components[-1], then call embedding_linker.create_embeddings(embedding_model_name="abhinand/MedEmbed-small-v0.1").

Which isn't ideal I suppose.

I think the only TODO in terms of development is deciding on what to do with similarity thresholds. I'm thinking that; if you generate your own link candidates a higher threshold is appropriate, but when you get to larger contexts then that should be a smaller threshold.

mart-r · 2025-08-29T10:55:56Z

The API to access the component would be:

from medcat.components.types import CoreComponentType
comp = cat.pipe.get_component(CoreComponentType.linking)

EDIT: You may want to resync with master - otherwise you're stuck with using cat._pipeline to access the pipe.

adam-sutton-1992 · 2025-09-01T13:11:33Z

Yep. That works nicely.

mart-r · 2025-09-02T09:57:39Z

The test failure should be fixed by #125

mart-r

Looking good overall!

A few things that need to be addressed.
A few things that should be addressed.
And a few things potentially up for discussion.

PS:
I think this also needs a bit of a tutorial. However, I'm happy for that to be in a separate PR.

medcat-v2/tests/components/linking/test_embedding_linker.py

medcat-v2/medcat/config/config.py

medcat-v2/tests/components/linking/test_embedding_linker.py

medcat-v2/medcat/components/linking/embedding_linker.py

adam-sutton-1992 · 2025-09-29T18:40:27Z

I did the change regarding max_length, otherwise I think everything is as it should be. :D

mart-r

There's 1 small thing with doc strings and a kind of comment on defaulting to filtering before disambiguation.

medcat-v2/medcat/config/config.py

mart-r

Looks good to me!

initial commit for embedding linker

c08a6d8

mart-r marked this pull request as draft August 4, 2025 15:22

update to embedding logic and additional configutations

41ba8c0

mart-r reviewed Aug 18, 2025

View reviewed changes

handling no link candidates along with fixes

3b08276

mart-r reviewed Aug 22, 2025

View reviewed changes

medcat-v2/medcat/components/linking/embedding_linker.py Outdated Show resolved Hide resolved

medcat-v2/medcat/components/linking/embedding_linker.py Outdated Show resolved Hide resolved

added testing and fixes from testing

5a57b9a

Merge remote-tracking branch 'origin/main' into embedding_linker

8200592

adam-sutton-1992 added 2 commits September 1, 2025 14:47

mypy fixes

5ae1604

fixed linting (hopefully)

8bf0c3b

mart-r mentioned this pull request Sep 2, 2025

Medcat v2 components test hotfix #125

Merged

adam-sutton-1992 added 3 commits September 8, 2025 16:08

added thresholds for short and long contexts

87a709b

Merge remote-tracking branch 'origin/main' into embedding_linker

a0d672d

fixed mypy issues

ba52002

adam-sutton-1992 marked this pull request as ready for review September 17, 2025 20:23

adam-sutton-1992 changed the title ~~initial commit for embedding linker~~ Embedding Linker using MLM based embeddings Sep 18, 2025

mart-r requested changes Sep 18, 2025

View reviewed changes

adam-sutton-1992 added 3 commits September 25, 2025 23:31

Added filter before disambig, and various tests

8109d21

handling cases with 1 candidate that's filtered out

b871eef

added max length logic and finals suggested changes

d9bf801

adam-sutton-1992 requested a review from mart-r September 29, 2025 18:39

mart-r requested changes Sep 30, 2025

View reviewed changes

medcat-v2/medcat/config/config.py Outdated Show resolved Hide resolved

medcat-v2/medcat/config/config.py Outdated Show resolved Hide resolved

changes to config documentation and filter before disambig

a03a62f

mart-r approved these changes Oct 1, 2025

View reviewed changes

Embedding Linker using MLM based embeddings #65

Are you sure you want to change the base?

Embedding Linker using MLM based embeddings #65

Uh oh!

Conversation

adam-sutton-1992 commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adam-sutton-1992 commented Aug 21, 2025

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mart-r commented Aug 22, 2025

Uh oh!

adam-sutton-1992 commented Aug 22, 2025

Uh oh!

mart-r commented Aug 22, 2025

Uh oh!

adam-sutton-1992 commented Aug 28, 2025

Uh oh!

mart-r commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam-sutton-1992 commented Sep 1, 2025

Uh oh!

mart-r commented Sep 2, 2025

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adam-sutton-1992 commented Sep 29, 2025

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adam-sutton-1992 commented Aug 4, 2025 •

edited

Loading

mart-r commented Aug 29, 2025 •

edited

Loading