Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a get_feature_names_out method in BaseEmbetter #105

Open
Vincent-Maladiere opened this issue Aug 22, 2024 · 6 comments
Open

Add a get_feature_names_out method in BaseEmbetter #105

Vincent-Maladiere opened this issue Aug 22, 2024 · 6 comments

Comments

@Vincent-Maladiere
Copy link

This will allow the parent _SetOutputMixin of TransformerMixin to enable set_output(transform={"pandas", "polars"}

@koaning
Copy link
Owner

koaning commented Aug 23, 2024

It can't hurt to add. When I started embetter this set_output stuff wasn't really out yet. I do wonder about the use-case though. Pandas and Polars aren't really tensor libraries and it is also pretty hard to interpret a single dimension of an embedding.

@koaning
Copy link
Owner

koaning commented Aug 23, 2024

The one thing that is slightly tricky is that I do not know the number of dimensions upfront. I only know them when I actually infer. That said, this feels like a cached property, so probably fine if it is calculated only once.

@Vincent-Maladiere
Copy link
Author

I do wonder about the use-case though. Pandas and Polars aren't really tensor libraries and it is also pretty hard to interpret a single dimension of an embedding.

Agreed, but sometimes having tensors in a dataframe can be useful for some operations (I'm saying that with skrub in mind in particular). Even if you can't interpret embeddings, when you use them within a heterogenous dataframe, it's nice to keep the context of each column name and specific dtypes.

The one thing that is slightly tricky is that I do not know the number of dimensions upfront. I only know them when I actually infer. That said, this feels like a cached property, so probably fine if it is calculated only once.

Right, so caching during transform would make sense here.

@glebzhelezov
Copy link
Contributor

Hi, I'm a (relatively...) long-time user, first time commenter.

it is also pretty hard to interpret a single dimension of an embedding.

I think this feature would make it more convenient to work with matryoshka embeddings, since it would allow you to drop the dimension of the embeddings by dropping some columns from the data frame. If there are multiple matryoshka embeddings per row, you could reduce all of their dimensions with one regex-defined column selection. (I know this is a stretch, but I have Jupyter notebooks where I've done worse.)

Is anyone working on this? If not, I am happy to work on the issue.

@koaning
Copy link
Owner

koaning commented Sep 13, 2024

I think this feature would make it more convenient to work with matryoshka embeddings, since it would allow you to drop the dimension of the embeddings by dropping some columns from the data frame.

Would this use-case maybe be better served by having an estimator that you can add to the pipeline as a follow up? Something that can limit the number of columns going out?

Also, could you elaborate on the use-case here where you might have multiple features that you want to embed?

@koaning
Copy link
Owner

koaning commented Sep 13, 2024

Is anyone working on this? If not, I am happy to work on the issue.

I am open to it, sure, go for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants