To find the documentation see https://phc-data-pipeline.readthedocs.io/en/latest/index.html
This project aims to develop a pipeline to systematically assess the performance of different ML methods on high dimensional datasets extracted from different domains.
The use case explored in this repository is that of large pharmacogenomic studies, specifically, GDSC, CTRP and CCLE.
First we design a pipeline as shown under
The pipeline is developed in the files:
-
methods.py: where we implement the normalization, preprocessing, feature selection, domain adaptation and drug resistance prediction methods. These methods work with sklearn methods and could be reused for training any type of tabular data. -
classes.py: where we implement thetuningandDrugclasses.tuning: is used to define a randomized hyper parameter search for a given drug resistance prediction method.Drug: is the central class of the pipeline. It uses the methods defined undermethods.pyand stores the results and data used for modeling a specific drug
-
runs.py: here we instantiate and run aDrugthrough each of the steps of the pipeline -
train.py: serves as an entry point to the pipeline. Allows to configure a run and store the results. It also defines hyperparameter search spaces for the different drug resistance prediction methods. -
config.py: should be created and contain:dir = '<PATH-TO-PROCESSED-DATA>/data/Processed/' guild = '<PATH-TO-GUILD>/venv/.guild/'
Examples of the use of the pipeline can be found under the jupyter notebooks:
methods-exampleshows an example of the use of themethods.pymethods. It trains a model on pharmacogenomic data and displays the results. It can be useful to understand the input given to each of the methods and the output received. By adding%%timeat the beginning of a cell it could also be used to analyze the time performance of each of the methods.drug-exampleshows an example of the use of theclasses.pyDrug class. Similar tomethods-exampleit runs through each of the steps of the pipeline. It can also be used to test performance improvements or the succesful implementation of new methods for the Drug class.run-exampleprovides an example of the use of therunmethod.
Two notebooks explore the given pharmacogenomic data:
data-cleaningcleans the name of the CCLs from GDSC, CCLE and CTRP ensuring that the same name format is used for CCLs. Here we also put together CCLs found on Pozdeyev's drug resistance CTRP, GDSC and CCLE data with the ones found on CellMinderCDB's gene expression data.data-explorationallows us to explore missing CCLs and understanding the intersection between the given datasets.
The last three notebooks are used for analyzing our results:
-
results-appendshows how to add the results of the individual models to the run results data as given by Guild. -
analysis-ic-qualityis an analysis of the impact of EC/IC quality on the$r^2$ scores of the models. A threshold is set under which models are disregarded due to the low quality of the data. -
results-anovaprovides an ANOVA analysis of the results based on the different elements of the configuration. Here we also find a statistical analysis of the importance of the different factors. -
results-hyperparametersprovides an initial exploratory analysis of the impact of the hyperparameters of the three best performing models (Random Forests, K Nearest Neighbours and Elastic Net) on the results.
