-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a get_feature_names_out
method in BaseEmbetter
#105
Comments
It can't hurt to add. When I started embetter this |
The one thing that is slightly tricky is that I do not know the number of dimensions upfront. I only know them when I actually infer. That said, this feels like a cached property, so probably fine if it is calculated only once. |
Agreed, but sometimes having tensors in a dataframe can be useful for some operations (I'm saying that with skrub in mind in particular). Even if you can't interpret embeddings, when you use them within a heterogenous dataframe, it's nice to keep the context of each column name and specific dtypes.
Right, so caching during |
Hi, I'm a (relatively...) long-time user, first time commenter.
I think this feature would make it more convenient to work with matryoshka embeddings, since it would allow you to drop the dimension of the embeddings by dropping some columns from the data frame. If there are multiple matryoshka embeddings per row, you could reduce all of their dimensions with one regex-defined column selection. (I know this is a stretch, but I have Jupyter notebooks where I've done worse.) Is anyone working on this? If not, I am happy to work on the issue. |
Would this use-case maybe be better served by having an estimator that you can add to the pipeline as a follow up? Something that can limit the number of columns going out? Also, could you elaborate on the use-case here where you might have multiple features that you want to embed? |
I am open to it, sure, go for it :) |
This will allow the parent
_SetOutputMixin
ofTransformerMixin
to enableset_output(transform={"pandas", "polars"}
The text was updated successfully, but these errors were encountered: