Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Model2Vec support #110

Open
Pringled opened this issue Oct 4, 2024 · 12 comments
Open

Add Model2Vec support #110

Pringled opened this issue Oct 4, 2024 · 12 comments

Comments

@Pringled
Copy link

Pringled commented Oct 4, 2024

Hi,

I think https://github.com/MinishLab/model2vec might be a good fit for Embetter. It's a static subword embedder that outperforms both GloVE (300d) and BPEmb (50k, 300d) while being much smaller and faster (results are in the repo).

It can be used like this:

from model2vec import StaticModel

# Load a model from the HuggingFace hub (in this case the M2V_base_output model)
model_name = "minishlab/M2V_base_output"
model = StaticModel.from_pretrained(model_name)

# Make embeddings
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
@koaning
Copy link
Owner

koaning commented Oct 4, 2024

I would like to benchmark this myself first, but I can agree that the idea that is mentioned here might be nice for a livestream on the probabl channel. I could explore it there and if it works out I can always choose to add it here.

@Pringled
Copy link
Author

Pringled commented Oct 4, 2024

Sounds good! Happy to answer any questions about the library.

@Pringled
Copy link
Author

@koaning Either option (via Sentence Transformers or directly with Model2Vec) should be easy to integrate. I think using Model2Vec directly is slightly more flexible since you can call encode to get a mean output and encode_as_sequence to get a sequence output (if you want to support multiple agg methods like in other supported embedders), and it requires a few less lines of code, e.g.:

from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/M2V_base_output")
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

vs

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_multilingual_output")
model = SentenceTransformer(modules=[static_embedding])
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

If you want to use any functionality from Sentence Transformers though then that's definitely the way to go.

@koaning
Copy link
Owner

koaning commented Oct 11, 2024

I will explore both during the probabl livestream next week. I will then decide afterwards which approach is best. I am also annotating some datasets now just so that I can have a benchmark.

I will also make another comparison; can scikit-learn pipelines with these embeddings beat an LLM?

@Pringled
Copy link
Author

Cool! Very curious about the results. I'll try to tune in for the livestream.

@koaning
Copy link
Owner

koaning commented Oct 11, 2024

I will add the livestream link after the current one, which will also be a fun one by the way.

@Pringled
Copy link
Author

Great, thanks for the link, I'll check it out!

@koaning
Copy link
Owner

koaning commented Oct 15, 2024

Aaaand it will go live this Friday!

https://www.youtube.com/live/Ymn5RVaKQA0

@koaning
Copy link
Owner

koaning commented Oct 15, 2024

Also @Pringled any discords that you hang out in? If you have any feedback on cool features to demonstrate I am all ears but I may also have some questions as I dive into this rabbit hole.

@koaning
Copy link
Owner

koaning commented Oct 15, 2024

Ah! Yeah, if you are in a Discord that would be cool because I may have found a way to make this a bunch lighter.

@Pringled
Copy link
Author

Heya @koaning, very cool! I created a channel in our discord server that we can use here: https://discord.gg/kvGKzc8t. Alternatively, my discord name is "pringled." (including the dot) so you can also PM me directly if that's easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants