-
Notifications
You must be signed in to change notification settings - Fork 3
Embedding Linker using MLM based embeddings #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this sounds great!
I've just left "a few" comments on things I'm getting confused about.
TODO or discuss currently that I know of:
Otherwise the changes I've submitted have done a few things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments on the code
You could also init at As for
I don't think there's really a way to do that gracefully. This would be a generative process - you're wanting to generate information that isn't there. Overall, you migth also want to add some tests to this. Probably with a very low-performant but small model. But something that shows the entire thing works in the technical sense. |
I've just this in the call method: if self.cdb.is_dirty: I think there are tons are permutations I can do around being smart with embedding names and cuis. But I think it might just not be worth it. |
I think that's fair. Though maybe also tell them the method to use for the re-embedding. I.e: linker = self.cat._pipeline.get_component(CoreComponentType.linking)
linker.embed_cui_names(linker.embedding_model_name) PS: I'd probably need to improve getting the pipe from the CAT object instead of using the protected attribute. |
Hihi, Additional changes:
Regarding:
I was curious what's the best way to access the embed linker, as it's needed to be called when embedding names and cuis? Currently I access the component via: Which isn't ideal I suppose. I think the only TODO in terms of development is deciding on what to do with similarity thresholds. I'm thinking that; if you generate your own link candidates a higher threshold is appropriate, but when you get to larger contexts then that should be a smaller threshold. |
The API to access the component would be: from medcat.components.types import CoreComponentType
comp = cat.pipe.get_component(CoreComponentType.linking) EDIT: You may want to resync with master - otherwise you're stuck with using |
Yep. That works nicely. |
The test failure should be fixed by #125 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good overall!
A few things that need to be addressed.
A few things that should be addressed.
And a few things potentially up for discussion.
PS:
I think this also needs a bit of a tutorial. However, I'm happy for that to be in a separate PR.
I did the change regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's 1 small thing with doc strings and a kind of comment on defaulting to filtering before disambiguation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
This is WIP just showing the inital commit of the embedding linker, and how I've tested / instantiated it.
WHAT IT DOES:
This is a linker that requires no training. However it does have higher computational overheads.
Given an embedding model name (such as sentence-transformers/all-MiniLM-L6-v2 or abhinand/MedEmbed-small-v0.1) we embed all names in cdb.name2info. This will result in a N dimensional vector for each name.
When provided with a bunch of entities from the ner step we compare their contexts to predict the appropriate CUI.
HOW IT'S BEEN TESTED:
I am currently changing the linker in an already existing model:
I then run this to test performance:
I've tested this on the original model, and twice with the new embedding linker using "abhinand/MedEmbed-small-v0.1" as the embedding model. While the model can use link_candidates provided by the vocab based ner step, it can also generate its own via a config for short context windows for a slight increase in performance.
Using context_based_linker:
Epoch: 0, Prec: 0.09023910064860668, Rec: 0.3306035738148675, F1: 0.14177920150470955
Using embedding linker:
Epoch: 0, Prec: 0.08745465862505634, Rec: 0.34906193780519146, F1: 0.13986681312645888
The only real important stat here is recall. While these look low, I'm led to believe that these usually come with filters, which I've avoided for a fairer comparison.