Description
Description
Oversampling modules sometimes return a truncated array in the multi-class instance. Apologies if this is a user error. Below example feeds in a multi-label matrix; unsure if this has implications for the algorithm (if so feel free to correct my understanding! :)).
Steps/Code to Reproduce
from imblearn.over_sampling import BorderlineSMOTE
bl = BorderlineSMOTE(random_state=0, n_jobs=8,k_neighbors=1)
x=np.random.randint(5, size=5000).reshape(1000,5)
y=np.random.randint(2, size=10000).reshape(1000,10)
#bl
bl_x, bl_y = bl.fit_resample(x,y)
bl_y.shape
Expected Results
Some array which features the same number of columns as the input.
(1000, 10)
Actual Results
Randomly truncates one of the columns during calls to fit_resample
and fit_sample
. Have toggled the cell in my notebook in sequence to discern a pattern; there is none. Result randomly appears in 1/4 results (ish). Even after controlling for the random state in the instance creation.
(1000, 9)
Versions
Linux-4.4.0-134-generic-x86_64-with-debian-stretch-sid
Python 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16)
[GCC 7.3.0]
NumPy 1.15.2
SciPy 1.1.0
Scikit-Learn 0.20.0
Imbalanced-Learn 0.4.1
Activity
glemaitre commentedon Oct 17, 2018
Yep I can reproduce it. This is pretty bad. Let'see where it comes from.
glemaitre commentedon Oct 17, 2018
Uhm actually your
y
is not something that we are supporting. We are supporting three cases:The case that you are giving is actually a multilabel case which is not supported. I have to check if we can raise an error.
jsl303 commentedon Apr 16, 2019
Sorry for opening again... Somewhat related...
Isn't multi label support implemented from here? #340
Sometimes all the terms can be confusing. multi class, multi label, multi output... Top of that one hot encoding, multi label binarizing, and so on....
glemaitre commentedon Apr 16, 2019
Only when it corresponds to a one-hot encoding of a multiclass problem (a single 1 per row where the row corresponds to the class). Otherwise, there is no literature to do it in a multi-label setting.
Regarding the definition of those terms, you can refer to scikit-learn directly:https://scikit-learn.org/stable/modules/multiclass.html