Skip to content

Oversampling modules return a truncated array in the multi-class instance #489

Closed
@samhardyhey

Description

@samhardyhey

Description

Oversampling modules sometimes return a truncated array in the multi-class instance. Apologies if this is a user error. Below example feeds in a multi-label matrix; unsure if this has implications for the algorithm (if so feel free to correct my understanding! :)).

Steps/Code to Reproduce

from imblearn.over_sampling import BorderlineSMOTE

bl = BorderlineSMOTE(random_state=0, n_jobs=8,k_neighbors=1)

x=np.random.randint(5, size=5000).reshape(1000,5)
y=np.random.randint(2, size=10000).reshape(1000,10)

#bl
bl_x, bl_y = bl.fit_resample(x,y)
bl_y.shape

Expected Results

Some array which features the same number of columns as the input.

(1000, 10)

Actual Results

Randomly truncates one of the columns during calls to fit_resample and fit_sample. Have toggled the cell in my notebook in sequence to discern a pattern; there is none. Result randomly appears in 1/4 results (ish). Even after controlling for the random state in the instance creation.

(1000, 9)

Versions

Linux-4.4.0-134-generic-x86_64-with-debian-stretch-sid
Python 3.6.6 |Anaconda, Inc.| (default, Oct 9 2018, 12:34:16)
[GCC 7.3.0]
NumPy 1.15.2
SciPy 1.1.0
Scikit-Learn 0.20.0
Imbalanced-Learn 0.4.1

Activity

glemaitre

glemaitre commented on Oct 17, 2018

@glemaitre
Member

Yep I can reproduce it. This is pretty bad. Let'see where it comes from.

added
Type: BugIndicates an unexpected problem or unintended behavior
on Oct 17, 2018
glemaitre

glemaitre commented on Oct 17, 2018

@glemaitre
Member

Uhm actually your y is not something that we are supporting. We are supporting three cases:

  • binary
  • multiclass (1D array with multiple values)
  • one-hot-encoded multiclass (2D in which we should have a single 1 per lines)

The case that you are giving is actually a multilabel case which is not supported. I have to check if we can raise an error.

jsl303

jsl303 commented on Apr 16, 2019

@jsl303

Sorry for opening again... Somewhat related...
Isn't multi label support implemented from here? #340
Sometimes all the terms can be confusing. multi class, multi label, multi output... Top of that one hot encoding, multi label binarizing, and so on....

glemaitre

glemaitre commented on Apr 16, 2019

@glemaitre
Member

Isn't multi label support implemented from here?

Only when it corresponds to a one-hot encoding of a multiclass problem (a single 1 per row where the row corresponds to the class). Otherwise, there is no literature to do it in a multi-label setting.

Regarding the definition of those terms, you can refer to scikit-learn directly:https://scikit-learn.org/stable/modules/multiclass.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: BugIndicates an unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @glemaitre@jsl303@samhardyhey

      Issue actions

        Oversampling modules return a truncated array in the multi-class instance · Issue #489 · scikit-learn-contrib/imbalanced-learn