-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Model2Vec support #110
Comments
I would like to benchmark this myself first, but I can agree that the idea that is mentioned here might be nice for a livestream on the probabl channel. I could explore it there and if it works out I can always choose to add it here. |
Sounds good! Happy to answer any questions about the library. |
@Pringled I guess this is the simplest integration path? |
@koaning Either option (via Sentence Transformers or directly with Model2Vec) should be easy to integrate. I think using Model2Vec directly is slightly more flexible since you can call encode to get a mean output and encode_as_sequence to get a sequence output (if you want to support multiple from model2vec import StaticModel
model = StaticModel.from_pretrained("minishlab/M2V_base_output")
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) vs from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding
static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_multilingual_output")
model = SentenceTransformer(modules=[static_embedding])
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) If you want to use any functionality from Sentence Transformers though then that's definitely the way to go. |
I will explore both during the probabl livestream next week. I will then decide afterwards which approach is best. I am also annotating some datasets now just so that I can have a benchmark. I will also make another comparison; can scikit-learn pipelines with these embeddings beat an LLM? |
Cool! Very curious about the results. I'll try to tune in for the livestream. |
I will add the livestream link after the current one, which will also be a fun one by the way. |
Great, thanks for the link, I'll check it out! |
Aaaand it will go live this Friday! |
Also @Pringled any discords that you hang out in? If you have any feedback on cool features to demonstrate I am all ears but I may also have some questions as I dive into this rabbit hole. |
Ah! Yeah, if you are in a Discord that would be cool because I may have found a way to make this a bunch lighter. |
Heya @koaning, very cool! I created a channel in our discord server that we can use here: https://discord.gg/kvGKzc8t. Alternatively, my discord name is "pringled." (including the dot) so you can also PM me directly if that's easier. |
Hi,
I think https://github.com/MinishLab/model2vec might be a good fit for Embetter. It's a static subword embedder that outperforms both GloVE (300d) and BPEmb (50k, 300d) while being much smaller and faster (results are in the repo).
It can be used like this:
The text was updated successfully, but these errors were encountered: