Add a `get_feature_names_out` method in BaseEmbetter #105

Vincent-Maladiere · 2024-08-22T12:09:26Z

This will allow the parent _SetOutputMixin of TransformerMixin to enable set_output(transform={"pandas", "polars"}

The text was updated successfully, but these errors were encountered:

koaning · 2024-08-23T08:12:34Z

It can't hurt to add. When I started embetter this set_output stuff wasn't really out yet. I do wonder about the use-case though. Pandas and Polars aren't really tensor libraries and it is also pretty hard to interpret a single dimension of an embedding.

koaning · 2024-08-23T08:14:08Z

The one thing that is slightly tricky is that I do not know the number of dimensions upfront. I only know them when I actually infer. That said, this feels like a cached property, so probably fine if it is calculated only once.

Vincent-Maladiere · 2024-09-02T14:04:54Z

I do wonder about the use-case though. Pandas and Polars aren't really tensor libraries and it is also pretty hard to interpret a single dimension of an embedding.

Agreed, but sometimes having tensors in a dataframe can be useful for some operations (I'm saying that with skrub in mind in particular). Even if you can't interpret embeddings, when you use them within a heterogenous dataframe, it's nice to keep the context of each column name and specific dtypes.

The one thing that is slightly tricky is that I do not know the number of dimensions upfront. I only know them when I actually infer. That said, this feels like a cached property, so probably fine if it is calculated only once.

Right, so caching during transform would make sense here.

glebzhelezov · 2024-09-12T22:18:18Z

Hi, I'm a (relatively...) long-time user, first time commenter.

it is also pretty hard to interpret a single dimension of an embedding.

I think this feature would make it more convenient to work with matryoshka embeddings, since it would allow you to drop the dimension of the embeddings by dropping some columns from the data frame. If there are multiple matryoshka embeddings per row, you could reduce all of their dimensions with one regex-defined column selection. (I know this is a stretch, but I have Jupyter notebooks where I've done worse.)

Is anyone working on this? If not, I am happy to work on the issue.

koaning · 2024-09-13T08:16:04Z

I think this feature would make it more convenient to work with matryoshka embeddings, since it would allow you to drop the dimension of the embeddings by dropping some columns from the data frame.

Would this use-case maybe be better served by having an estimator that you can add to the pipeline as a follow up? Something that can limit the number of columns going out?

Also, could you elaborate on the use-case here where you might have multiple features that you want to embed?

koaning · 2024-09-13T08:16:27Z

Is anyone working on this? If not, I am happy to work on the issue.

I am open to it, sure, go for it :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `get_feature_names_out` method in BaseEmbetter #105

Add a `get_feature_names_out` method in BaseEmbetter #105

Vincent-Maladiere commented Aug 22, 2024

koaning commented Aug 23, 2024

koaning commented Aug 23, 2024

Vincent-Maladiere commented Sep 2, 2024

glebzhelezov commented Sep 12, 2024

koaning commented Sep 13, 2024

koaning commented Sep 13, 2024

Add a get_feature_names_out method in BaseEmbetter #105

Add a get_feature_names_out method in BaseEmbetter #105

Comments

Vincent-Maladiere commented Aug 22, 2024

koaning commented Aug 23, 2024

koaning commented Aug 23, 2024

Vincent-Maladiere commented Sep 2, 2024

glebzhelezov commented Sep 12, 2024

koaning commented Sep 13, 2024

koaning commented Sep 13, 2024

Add a `get_feature_names_out` method in BaseEmbetter #105

Add a `get_feature_names_out` method in BaseEmbetter #105