Skip to content

Development for multioutput regression #803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 100 commits into from
Jul 3, 2020

Conversation

charlesfu4
Copy link
Contributor

Multioutput regression based on #292 revision according to what mfeurer mentioned. look like there are still some part missing, I will try to figure it out.

@charlesfu4
Copy link
Contributor Author

just figured out there are more parts need to be revised

  • metrics.py
  • smbo.py
    and so on

@charlesfu4
Copy link
Contributor Author

charlesfu4 commented May 1, 2020

Long time no see. I have revised some part of the pipelines that caused crashes last time. Now it works on a multioutput regression of a 96-output problem for me.

  • Note that the ensemble_memory_limit should be adjusted larger.
    For example, fitting trian_X, train_y with size (35605, 112), (35605, 96), and ensemble_nbest can be only set to 12 with ensemble memory limit given 2048MB.

  • There are minor issues happening during fitting process
    For example sometimes it got stuck at the final time when the SMAC providing challenger not better than the previous one. Not sure what caused this.

The warning messages are shown below:
[WARNING] [2020-05-02 00:38:08,675:smac.intensification.intensification.Intensifier] Challenger was the same as the current incumbent; Skipping challenger
[WARNING] [2020-05-02 00:38:08,675:smac.intensification.intensification.Intensifier] Challenger was the same as the current incumbent; Skipping challenger

Not yet supporting sklearn.multioutput wrapper. But those regressors intrinsically support multioutput will be considered as ensembles combination.

The travis_ci tester seems not supporting my new MULTIOUPUT_REGRESSION task label.

@charlesfu4
Copy link
Contributor Author

I found out my revision works on the 0.6.0 version but not the development version.
In the development version, it would only output Dummy regressor after the same running time. I think that's because the new budget functionality that caused this problem. I will figure out how to fix it.

charlesfu4 and others added 11 commits May 4, 2020 11:52
* First version of 070 release notes

* Missed a bugfix

* Vim added unexpected space -- fix
…ty01

Clip predict values to [0-1] in classification
…automl#843)

Currently default value of 'score_func' for SelectPercentileRegression
is "f_classif", which is an invalid value, and will surely be rejected and
will not work
* More robust tmp file naming

* UUID approach
* Initial Commit

* Make worst result a function

* worst possible result in metric

* Fixing the name of the scorers
* Add exceptions to log file, not just stdout

* Removing dummy pred as trys is not needed
@mfeurer mfeurer closed this Jun 18, 2020
@mfeurer mfeurer reopened this Jun 18, 2020
Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks a lot for all your effort!

Mulitoutput regression only picks regreesors natively support multioutput, not sure show add Multioutput wrapper function of sklearn for the others or not, should try it

I guess this is fine for the beginning

Feature preprocessing has not yet implemented multioutput part, but in my test locally it still somehow works. This should be dealt later on.

I guess it would be good to check this actually now. A simple way to do so is to add a new unit test in test/test_pipeline/test_regression.py following the pattern of test_configurations (or test_multiclass in the respective classifition test). That'll randomly sample configurations and fail if they are invalid.

Also, could it be that you forgot to change the properties of the gradient boosting classifier?

Also, it would probably be good if you test the AutoSklearnRegressor with multilabel output in test/test_automl/test_estimators to ensure that it will continue to work in the future.

BTW: we're currently looking into improving the memory usage of the ensemble part, so it should be able to handle more models with less RAM usage in the future.

@charlesfu4
Copy link
Contributor Author

Thanks for the advices and corrections!

I guess it would be good to check this actually now. A simple way to do so is to add a new unit test in test/test_pipeline/test_regression.py following the pattern of test_configurations (or test_multiclass in the respective classifition test). That'll randomly sample configurations and fail if they are invalid.

Sure I will do that. Will taking dataset.load_linenerud as the multioutput regression testing data good enough? Or a random generated multioutput regression target will be better?

Also, it would probably be good if you test the AutoSklearnRegressor with multilabel output in test/test_automl/test_estimators to ensure that it will continue to work in the future.

Do you mean multilabel-indicator? From my understanding, shouldn't multioutput regression include continuous-multioutput only or at most include also multiclass-multioutput?

@charlesfu4
Copy link
Contributor Author

charlesfu4 commented Jun 22, 2020

  • Added label handles_multioutput:
    For all the models under feature_preprocessing, data_preprocessing, and classification(all set to False). Whether each fp or dp model supports multioutput was considered by sklearn documentation.

  • Added multioutput unit test and fixed bugs:

    • Find out there were few bugs: Under pipeline/regression.py, includes a repeated function get_available_components, which also appears in pipeline/components/regression/init.py.

    • Under pipeline/components/regression/init.py, a redundant term data_prop was replaced as dataset_properties.

    • Add 'handels_multioutput' in ThirdPartComponents should_be_there of pipeline/components/base.py

    • Add 'handels_multioutput' in DummyClassifier and DummyPreprocessor in test/test_pipeline/test_classification.py

    • Add test_multioutput in test/test_pipeline/test_regression.py, with random generated data (20 features and 4 targets)

  • Problem with Kernel_PCA:
    I tried to dodge the error several times by changing the train_size_maximum. But it's better to solve it in the scikit-learn back end.
    BUG Fixes kernel PCA raising "invalid value encountered in mul… scikit-learn/scikit-learn#16718 solved it and I think it is only available until scikit-learn=0.23, test_kernel_pca almost always failed I I think this could be solved after making this version compatible to scikit-learn=0.23

@charlesfu4
Copy link
Contributor Author

Hi @mfeurer would you review my code? Thanks!
Also, I found out that if I fed resampling strategy with sklearn BaseCrossValidator like KFold or TimeSeriesSplit object, it always returned Dummy classifier and Dummy regressor. I also tested them on 0.7.0 master version without my multi-output regression modification. The results were the same. I think I can try to look into it and fix it, should I open an issue for it? Thank you!

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @charlesfu4 this looks really great, thanks for your work!

I have to minor questions and then we're good to merge this I believe. We'll do a release today (0.7.1) and will then include your changes in the next release (0.7.2).

Regarding your comments:

  • yes, we're aware of the kernelPCA problem and unfortunately it'll only be resolved once we upgrade scikit-learn to 0.23
  • yes, it would be great if you could have a look into why it fails with the KFold resampling strategy. I don't think you need to look into the time series split as that one is not really working well yet anyway.

@charlesfu4
Copy link
Contributor Author

Hey @charlesfu4 this looks really great, thanks for your work!

I have to minor questions and then we're good to merge this I believe. We'll do a release today (0.7.1) and will then include your changes in the next release (0.7.2).

Regarding your comments:

  • yes, we're aware of the kernelPCA problem and unfortunately it'll only be resolved once we upgrade scikit-learn to 0.23
  • yes, it would be great if you could have a look into why it fails with the KFold resampling strategy. I don't think you need to look into the time series split as that one is not really working well yet anyway.

It's my pleasure to participate in the development, thank you for the code reviews and advices.

@mfeurer mfeurer merged commit 9a8ba56 into automl:development Jul 3, 2020
franchuterivera added a commit to franchuterivera/auto-sklearn that referenced this pull request Aug 21, 2020
* PEP8 (automl#718)

* multioutput_regression

* multioutput_regression

* multioutput_regression

* multioutput regression

* multioutput regression

* multioutput regression

* multioutput regression

* multioutput regression

* automl#782 showcase pipeline components iteration

* Fixed flake-8 violations

* multi_output regression v1

* fix y_shape in multioutput regression

* fix xy_data_manager change due to merge

* automl.py missing import

* Release note 070 (automl#842)

* First version of 070 release notes

* Missed a bugfix

* Vim added unexpected space -- fix

* prepare new release (automl#846)

* Clip predict values to [0-1] in classification

* Fix for 3.5 python!

* Sensible default value of 'score_func' for SelectPercentileRegression (automl#843)

Currently default value of 'score_func' for SelectPercentileRegression
is "f_classif", which is an invalid value, and will surely be rejected and
will not work

* More robust tmp file naming (automl#854)

* More robust tmp file naming

* UUID approach

* 771 worst possible result (automl#845)

* Initial Commit

* Make worst result a function

* worst possible result in metric

* Fixing the name of the scorers

* Add exceptions to log file, not just stdout (automl#863)

* Add exceptions to log file, not just stdout

* Removing dummy pred as trys is not needed

* Add prediction with models trained with cross-validation (automl#864)

* add the possibility to predict with cross-validation

* fix unit tests

* test new feature, too

* 715 ml memory (automl#865)

* automl#715 Support for no ml memory limit

* API update

* Docs enhancement (automl#862)

* Improved docs

* Fixed example typos

* Beautify examples

* cleanup examples

* fixed rsa equal

* Move to minmax scaler (automl#866)

* Do not read predictions in memory, only after score (automl#870)

* Do not read predictions in memory, only after score

* Precission support for string/int

* Removal of competition manager (automl#869)

* Removal of competition manager

* Removed additional unused methods/files and moved metrics to estimator

* Fix meta data generation

* Make sure pytest is older newer than 4.6

* Unit tst fixing

* flake8 fixes in examples

* Fix metadata gen metrics

* Fix dataprocessing get params (automl#877)

* Fix dataprocessing get params

* Add clone-test to regression pipeline

* Allow 1-D threshold binary predictions (automl#879)

* fix single output regression not working

* regression need no _enusre_prediction_array_size_prediction_array_sizess

* automl#782 showcase pipeline components iteration

* Fixed flake-8 violations

* Release note 070 (automl#842)

* First version of 070 release notes

* Missed a bugfix

* Vim added unexpected space -- fix

* prepare new release (automl#846)

* Clip predict values to [0-1] in classification

* Fix for 3.5 python!

* Sensible default value of 'score_func' for SelectPercentileRegression (automl#843)

Currently default value of 'score_func' for SelectPercentileRegression
is "f_classif", which is an invalid value, and will surely be rejected and
will not work

* More robust tmp file naming (automl#854)

* More robust tmp file naming

* UUID approach

* 771 worst possible result (automl#845)

* Initial Commit

* Make worst result a function

* worst possible result in metric

* Fixing the name of the scorers

* Add exceptions to log file, not just stdout (automl#863)

* Add exceptions to log file, not just stdout

* Removing dummy pred as trys is not needed

* Add prediction with models trained with cross-validation (automl#864)

* add the possibility to predict with cross-validation

* fix unit tests

* test new feature, too

* 715 ml memory (automl#865)

* automl#715 Support for no ml memory limit

* API update

* Docs enhancement (automl#862)

* Improved docs

* Fixed example typos

* Beautify examples

* cleanup examples

* fixed rsa equal

* Move to minmax scaler (automl#866)

* Do not read predictions in memory, only after score (automl#870)

* Do not read predictions in memory, only after score

* Precission support for string/int

* Removal of competition manager (automl#869)

* Removal of competition manager

* Removed additional unused methods/files and moved metrics to estimator

* Fix meta data generation

* Make sure pytest is older newer than 4.6

* Unit tst fixing

* flake8 fixes in examples

* Fix metadata gen metrics

* Fix dataprocessing get params (automl#877)

* Fix dataprocessing get params

* Add clone-test to regression pipeline

* Allow 1-D threshold binary predictions (automl#879)

* multioutput_regression

* multioutput_regression

* multioutput_regression

* multioutput_regression

* multioutput_regression

* multioutput_regression

* multioutput regression

* multioutput regression

* multioutput regression

* multioutput regression

* multi_output regression v1

* fix y_shape in multioutput regression

* fix xy_data_manager change due to merge

* fix single output regression not working

* regression need no _enusre_prediction_array_size_prediction_array_sizess

* Add prediction with models trained with cross-validation (automl#864)

* add the possibility to predict with cross-validation

* fix unit tests

* test new feature, too

* multioutput_regression

* multioutput_regression

* multioutput_regression

* Removal of competition manager (automl#869)

* Removal of competition manager

* Removed additional unused methods/files and moved metrics to estimator

* Fix meta data generation

* Make sure pytest is older newer than 4.6

* Unit tst fixing

* flake8 fixes in examples

* Fix metadata gen metrics

* multioutput after rebased to 0.7.0

Problem:

Cause:

Solution:

* Regressor target y shape index out of range

* Revision for make tester

* Revision: Cancel Multiclass-MultiOuput

* Resolve automl.py metrics(__init__) reg_gb reg_svm

* Fix Flake8 errors

* Fix automl.py flake8

* Preprocess w/ mulitout reg,automl self._n_outputs

* test_estimator.py changed back

* cancel multioutput multiclass for multi reg

* Fix automl self._n_output update placement

* fix flake8

* Kernel pca cancelled mulitout reg

* Kernel PCA test skip python <3.8

* Add test unit for multioutput reg and fix.

* Fix flake8 error

* Kernel PCA multioutput regression

* default kernel to cosine, dodge sklearn=0.22 error

* Kernel PCA should be updated to 0.23

* Kernel PCA uses rbf kernel

* Kernel Pca

* Modify labels in reg, class, perpro in examples

* Kernel PCA

* Add missing supports to mincoal and truncateSVD

Co-authored-by: Matthias Feurer <[email protected]>
Co-authored-by: chico <[email protected]>
Co-authored-by: Francisco Rivera Valverde <[email protected]>
Co-authored-by: Xiaodong DENG <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants