Add Model2Vec support #110

Pringled · 2024-10-04T13:34:28Z

Hi,

I think https://github.com/MinishLab/model2vec might be a good fit for Embetter. It's a static subword embedder that outperforms both GloVE (300d) and BPEmb (50k, 300d) while being much smaller and faster (results are in the repo).

It can be used like this:

from model2vec import StaticModel

# Load a model from the HuggingFace hub (in this case the M2V_base_output model)
model_name = "minishlab/M2V_base_output"
model = StaticModel.from_pretrained(model_name)

# Make embeddings
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

koaning · 2024-10-04T15:08:44Z

I would like to benchmark this myself first, but I can agree that the idea that is mentioned here might be nice for a livestream on the probabl channel. I could explore it there and if it works out I can always choose to add it here.

Pringled · 2024-10-04T16:26:28Z

Sounds good! Happy to answer any questions about the library.

koaning · 2024-10-11T07:26:37Z

@Pringled I guess this is the simplest integration path?

https://www.linkedin.com/posts/minish-lab_big-news-model2vec-is-now-officially-integrated-activity-7250399345320103936-JrqY?utm_source=share&utm_medium=member_desktop

Pringled · 2024-10-11T07:41:13Z

@koaning Either option (via Sentence Transformers or directly with Model2Vec) should be easy to integrate. I think using Model2Vec directly is slightly more flexible since you can call encode to get a mean output and encode_as_sequence to get a sequence output (if you want to support multiple agg methods like in other supported embedders), and it requires a few less lines of code, e.g.:

from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/M2V_base_output")
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

vs

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_multilingual_output")
model = SentenceTransformer(modules=[static_embedding])
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])

If you want to use any functionality from Sentence Transformers though then that's definitely the way to go.

koaning · 2024-10-11T08:14:07Z

I will explore both during the probabl livestream next week. I will then decide afterwards which approach is best. I am also annotating some datasets now just so that I can have a benchmark.

I will also make another comparison; can scikit-learn pipelines with these embeddings beat an LLM?

Pringled · 2024-10-11T08:55:36Z

Cool! Very curious about the results. I'll try to tune in for the livestream.

koaning · 2024-10-11T09:00:27Z

I will add the livestream link after the current one, which will also be a fun one by the way.

Pringled · 2024-10-11T10:12:47Z

Great, thanks for the link, I'll check it out!

koaning · 2024-10-15T09:08:59Z

Aaaand it will go live this Friday!

https://www.youtube.com/live/Ymn5RVaKQA0

koaning · 2024-10-15T09:12:18Z

Also @Pringled any discords that you hang out in? If you have any feedback on cool features to demonstrate I am all ears but I may also have some questions as I dive into this rabbit hole.

koaning · 2024-10-15T09:30:51Z

Ah! Yeah, if you are in a Discord that would be cool because I may have found a way to make this a bunch lighter.

Pringled · 2024-10-15T10:36:20Z

Heya @koaning, very cool! I created a channel in our discord server that we can use here: https://discord.gg/kvGKzc8t. Alternatively, my discord name is "pringled." (including the dot) so you can also PM me directly if that's easier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Model2Vec support #110

Add Model2Vec support #110

Pringled commented Oct 4, 2024

koaning commented Oct 4, 2024

Pringled commented Oct 4, 2024

koaning commented Oct 11, 2024

Pringled commented Oct 11, 2024

koaning commented Oct 11, 2024

Pringled commented Oct 11, 2024

koaning commented Oct 11, 2024

Pringled commented Oct 11, 2024

koaning commented Oct 15, 2024

koaning commented Oct 15, 2024

koaning commented Oct 15, 2024

Pringled commented Oct 15, 2024

Add Model2Vec support #110

Add Model2Vec support #110

Comments

Pringled commented Oct 4, 2024

koaning commented Oct 4, 2024

Pringled commented Oct 4, 2024

koaning commented Oct 11, 2024

Pringled commented Oct 11, 2024

koaning commented Oct 11, 2024

Pringled commented Oct 11, 2024

koaning commented Oct 11, 2024

Pringled commented Oct 11, 2024

koaning commented Oct 15, 2024

koaning commented Oct 15, 2024

koaning commented Oct 15, 2024

Pringled commented Oct 15, 2024