Skip to content

Conversation

adam-sutton-1992
Copy link
Contributor

@adam-sutton-1992 adam-sutton-1992 commented Aug 4, 2025

This is WIP just showing the inital commit of the embedding linker, and how I've tested / instantiated it.

WHAT IT DOES:

This is a linker that requires no training. However it does have higher computational overheads.

Given an embedding model name (such as sentence-transformers/all-MiniLM-L6-v2 or abhinand/MedEmbed-small-v0.1) we embed all names in cdb.name2info. This will result in a N dimensional vector for each name.

When provided with a bunch of entities from the ner step we compare their contexts to predict the appropriate CUI.

HOW IT'S BEEN TESTED:

I am currently changing the linker in an already existing model:

cat = CAT.load_model_pack("/data/adam/models/kch_gstt_v2")
register_core_component(CoreComponentType.linking, EmbeddingLinker.name, EmbeddingLinker.create_new_component)

cat.config.components.linking = EmbeddingLinking()
cat.config.components.linking.comp_name = EmbeddingLinker.name

embedding_linker = EmbeddingLinker(
    cdb=cat.cdb,
    config=cat.config
)

cat._pipeline._components[-1] = embedding_linker

cat._recreate_pipe()

embedding_linker.create_embeddings(embedding_model_name="abhinand/MedEmbed-small-v0.1")

cat.save_model_pack(target_folder="/data/adam/models/",
                    pack_name="kch_gstt_v2_MedEmbed_RAW",
                    add_hash_to_pack_name=False,
                    make_archive=False
)

I then run this to test performance:

from medcat.cat import CAT
from medcat.components.types import register_core_component
from medcat.components.types import CoreComponentType
from medcat.components.linking.embedding_linker import Linker as EmbeddingLinker
from medcat.stats.stats import get_stats
from medcat.data.mctexport import MedCATTrainerExport

register_core_component(CoreComponentType.linking, EmbeddingLinker.name, EmbeddingLinker.create_new_component)
cat = CAT.load_model_pack("/data/adam/models/kch_gstt_v2_embedding_linker")

**loading distemist and snomed_ct test sets**

all_projects = MedCATTrainerExport(projects=[])
all_projects["projects"].append(distemist)
all_projects["projects"].append(snomed_ct)
get_stats(cat, all_projects)

I've tested this on the original model, and twice with the new embedding linker using "abhinand/MedEmbed-small-v0.1" as the embedding model. While the model can use link_candidates provided by the vocab based ner step, it can also generate its own via a config for short context windows for a slight increase in performance.

Using context_based_linker:
Epoch: 0, Prec: 0.09023910064860668, Rec: 0.3306035738148675, F1: 0.14177920150470955

Using embedding linker:
Epoch: 0, Prec: 0.08745465862505634, Rec: 0.34906193780519146, F1: 0.13986681312645888

The only real important stat here is recall. While these look low, I'm led to believe that these usually come with filters, which I've avoided for a fairer comparison.

@mart-r mart-r marked this pull request as draft August 4, 2025 15:22
Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this sounds great!

I've just left "a few" comments on things I'm getting confused about.

@adam-sutton-1992
Copy link
Contributor Author

TODO or discuss currently that I know of:

  • Initialising for Inference I think having it embed upon init might not be the best way, as it can be time consuming. Especially if you're planning to change the embedding model. Altough handling is_dirty gracefuly is also required.
  • Testing with removal of stop words / punctuation.
  • Best way to embed names with seperators gracefully via reverse engineering the method or a suitable compromise.
  • Optimal settings in the config as a base.
  • Optimal similarity_threshold for precision / recall / f1.

Otherwise the changes I've submitted have done a few things:

  • Added generating link candidates via names. If you choose to ignore candidates provided by the ner step, or have a detected entity with no candidates some are generated. This slightly improves performance in both cases, and is even better when combined with filtering.
  • Removed infer by cui. With the generation of link candidates via names it feels quite obsolete. We only infer via cui names when disambiguating based on the longest name OR the preferred name (which can do with a bit more testing).
  • Various bug fixes and improvements as discussed here.

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments on the code

@mart-r
Copy link
Collaborator

mart-r commented Aug 22, 2025

Initialising for Inference I think having it embed upon init might not be the best way, as it can be time consuming. Especially if you're planning to change the embedding model. Altough handling is_dirty gracefuly is also required.

You could also init at __call__ time when it's first needed. But that's really quite unintuitive since it would mean the first inference call would take "forever" - even if it was for one small piece of text. Which could cause the user to think it's broken.
Like I said before though - as long as you provide a tutorial alongside that shows what the initial step(s) are to get this saved on disk, it'll probably be fine. So if the initial model creation involves an additional step (after setting config and stuff) to do the embedding, and save the model after that, I'd say that would be good enoguh.

As for is_dirty, I think the best you can do is to check for the flag and log a warning. Because you could recalculate at __call__ time if you find the CDB to be dirty, but this may - again - take so long that the user thinks the whole thing is broken. Though if the change is addative (in terms of CUIs or or names) then you should be able to only embed the bits that were added. But things may not be as easy if there's stuff removed (though probably still technically doable).
The other option would be to overwrite some of the CDB's methods (to add / remove cuis / names) and either log a warning at that time or even raise an exception (this could also be a configuration option).

Best way to embed names with seperators gracefully via reverse engineering the method or a suitable compromise.

I don't think there's really a way to do that gracefully. This would be a generative process - you're wanting to generate information that isn't there.
The only real way I could see this working is if you had the SNOMED release (and the UMLS bits that were used for enrichment) and then just mapped all the processed names to their originals. You could do this as a one time thing on a per model basis, but it'd be tedious to do in a general manner.

Overall, you migth also want to add some tests to this. Probably with a very low-performant but small model. But something that shows the entire thing works in the technical sense.

@adam-sutton-1992
Copy link
Contributor Author

I've just this in the call method:

if self.cdb.is_dirty:
logging.warning("CDB has been modified since last save/load. This might significantly affect linking performance.")
logging.warning("If you have added new concepts or changes, please re-embed the CDB names and cuis before linking.")

I think there are tons are permutations I can do around being smart with embedding names and cuis. But I think it might just not be worth it.

@mart-r
Copy link
Collaborator

mart-r commented Aug 22, 2025

I think that's fair.

Though maybe also tell them the method to use for the re-embedding. I.e:

linker = self.cat._pipeline.get_component(CoreComponentType.linking)
linker.embed_cui_names(linker.embedding_model_name)

PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute.

@adam-sutton-1992
Copy link
Contributor Author

Hihi,

Additional changes:

  • first bits of unit testing.
  • Fixes based on feedback from said unit testing :)
  • Optimal hyper-params for the linker including (including stop words, context window length, etc.).

Regarding:

  • PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute.

I was curious what's the best way to access the embed linker, as it's needed to be called when embedding names and cuis?

Currently I access the component via: cat._pipeline._components[-1], then call embedding_linker.create_embeddings(embedding_model_name="abhinand/MedEmbed-small-v0.1").

Which isn't ideal I suppose.

I think the only TODO in terms of development is deciding on what to do with similarity thresholds. I'm thinking that; if you generate your own link candidates a higher threshold is appropriate, but when you get to larger contexts then that should be a smaller threshold.

@mart-r
Copy link
Collaborator

mart-r commented Aug 29, 2025

The API to access the component would be:

from medcat.components.types import CoreComponentType
comp = cat.pipe.get_component(CoreComponentType.linking)

EDIT: You may want to resync with master - otherwise you're stuck with using cat._pipeline to access the pipe.

@adam-sutton-1992
Copy link
Contributor Author

Yep. That works nicely.

@mart-r
Copy link
Collaborator

mart-r commented Sep 2, 2025

The test failure should be fixed by #125

@adam-sutton-1992 adam-sutton-1992 marked this pull request as ready for review September 17, 2025 20:23
@adam-sutton-1992 adam-sutton-1992 changed the title initial commit for embedding linker Embedding Linker using MLM based embeddings Sep 18, 2025
Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good overall!

A few things that need to be addressed.
A few things that should be addressed.
And a few things potentially up for discussion.

PS:
I think this also needs a bit of a tutorial. However, I'm happy for that to be in a separate PR.

@adam-sutton-1992
Copy link
Contributor Author

I did the change regarding max_length, otherwise I think everything is as it should be. :D

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's 1 small thing with doc strings and a kind of comment on defaulting to filtering before disambiguation.

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants