-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large Datasets in terms of Number of Attributes #307
Comments
Hi @FiratIsmailoglu, thanks for using I am not familiar with gene microarray classification, so I couldn't say which algorithm exactly would be best, but indeed if your problem is classification you should use the supervised ones as you said http://contrib.scikit-learn.org/metric-learn/metric_learn.html#supervised-learning-algorithms, maybe you could try them all (except Regarding the preprocessing, you could try to simply center and normalize your data, and I would advise trying not to do a PCA pre-processing: indeed (It could happen however that doing a PCA preprocessing gives better results on the test set, because it could prevent overfitting in its own way, but I would still advise to do no PCA pre-processing and to regularize the problem by reducing the number of components of the metric-learning algorithm if needed. |
Hi @FiratIsmailoglu, to complement @wdevazelhes's reply, you could try SCML (more precisely in your case, its supervised version It allows you to learn a metric as a weighted combination of simple rank-one matrices, which can either be constructed automatically from data ( With other supervised metric learning algorithms, as advised by @wdevazelhes, you should not use PCA pre-processing but rather set Hope this helps |
@wdevazelhes Hi William, thank you so much for your support. Upon your post, I immdeiatley gave up using PCA and gave it a try to reduce n_components. Sadly could not reduce the running time significantly, considering LMNN, NCA and NFDA in the supervised setting. Specifcally, I reduced n_component from say around 10K to 100, but did not observe a remarkable difference in terms of the time But thank you anyways:) @bellet Aurelien, thank you for your suggestion and for your time. Honestly, in my first implemenations I tried the metric learning methods in the given example (plot_metric_learning_examples.py) only, so had no idea about SCML. However after your post I did try SCML and hada look at your paper. It really works for high dimensional datasets, it is really fast. My observation is, examples from the same classes are close to one another, while those from different classes are far apart in the transformed space created SCML; so I was expecting to see some improvments for the distance-based classifiers such as k-NN, LVQ, in the transformed space, but interestingly did not witness such imporvements. Maybe there is need for playing with the number of bases. Thank you once again. |
@FiratIsmailoglu I'm sorry reducing the number of components didn't help, but I'm glad that SCML worked when observing the new distances! Yes, maybe playing around with the different parameters will help, even the parameters of k-NN and LVQ too, sometimes metric-learning algorithms work better for certain values of the number of neighbors in knn (though I don't know for SCML what would be the best way to improve the performance of a downstream classifier, @bellet what do you think ?) For tweaking the parameters, you may have already used this, but metric-learn respects the scikit-learn API so you can make your classification predictor (e.g. SCML_Supervised + KNN) as a |
Yes, tuning some of the hyperparameters of |
Thanks all for the comments and the related discussion. I am trying to understand better the basis and n_basis arguments and their role inside the SCML_supervised algorithm. Is this a manner of fixing the points that we want to use for computing the differences? Could you please provide me some more specific information or a relative sample with a toy dataset (e.g., wine dataset)? Furthermore, it is somewhat strange the fact that the documentation of the SCML_supervised algorithm contains an example of the weakly SCML variant, probably you would like to fix it. Thanks a lot again. |
Thanks @terry07 for your interest in In the supervised setting, in lack of specific knowledge about your task that could suggest the use of specific bases, we recommend using discriminative bases generated by LDA (default behavior with Thanks for pointing out the fact that we do not have a proper example with
|
Hi, thank you very mucf for making such a great library accesible.. I'd like to get you advise regarding large datasets. That is, I have some supervised gene microarray datasets with number of features around 10K, and my goal is classification. So, which supervised metric learning algorithm would you recemommend in this case, and what kind of prepecosessing should I make prior to implementing the metric learning library? Many thanks..
Note:Having applied PCA, the algorithms perform quite bad.
The text was updated successfully, but these errors were encountered: