Skip to content
YMreyoud edited this page Apr 22, 2022 · 5 revisions

MegaD: A python based machine learning software for metagenomic analysis to identify and predict disease sample accurately using deep neural networks.

Machine learning has been utilized in many applications from biomedical imaging to business analytics. Machine learning is stipulated to be a strong method for diagnostics and even for determining therapeutics in future as we move to precision medicine. MegaD provides an unprecedented opportunity to develop neural networks from metagenomic data available publicly as well as to perform classification of data samples based on the optimal model we developed.

The description below walks you through the analysis of the ___ project (https://pubs.broadinstitute.org/diabimmune)

The general workflow is described in below.

Pre-requisites:

Installing MegaD:

Download the python file from...

Getting Started

The following packages are required to run MegaD:

  • Numpy
  • Tensorflow(1.5)
  • Pandas
  • Sklearn

These packages can be installed by running the following commands in a console:

pip install  "tensorflow>=1.15,<2.0"
pip install pandas
pip install sklearn

Once all packages have been installed, import the MegaD python file into your python instance as follows:

import MegaD

Data Input

MegaD can take both OTU table and BIOM file from popular metagenomic profiling tools, Kraken2 and qiime. MegaD provides a set of pre-processed datasets for use in training.

With Kraken2 installed, follow the instructions at https://github.com/DerrickWood/kraken2/wiki/Manual to generate a taxonomic profile of your 16S or WGS data.

Training the Model

To train the model, run the following command: model = Train_model(GridSearch = True, Threshold = 0.05, Normalize = True, level = 'All') The GridSearch parameters can be used to leverage randomized grid search for hyperparameter optimization of the model. The threshold parameter is used to prune the data of abundances that fall below the threshold, which tends to increase the accuracy of the model. The normalization parameter executes data normalization when set to true. The level parameter determines which taxonomic levels to use for classification. Options are: 'All', 'Species', and 'Genus'

After entering the command, you will be prompted to enter the file names of your training data and metadata file.

Criteria for feature selection

Genus Level and Species Level tabs return genus and species level from the dataset as the feature. All Level tab tracks back the taxon level for unclassified higher order.

Threshold

This field is getting a floating number to remove profiles and their abundances below the threshold value. Default value is 0.

Normalization

There is a choice for normalizing the data. Normalization is achieved using the cumulative sum scaling (CSS) method.

Prediction with trained model

To predict an unknown profile using a trained model, run the following command. Predict(model) Then enter the path of the dataset for which you wish to predict disease status.

This will return a prediction based on the trained model used. MegaD provides a set of pretrained models for quick analysis.