BC_MolSubtyping

This is a modified repository of MultiStainDeepLearning by Foersch et al.

This is the GitHub Repository for the classification of breast H&E-stained histopathology images. It was adapted from https://github.com/AGFoersch/MultiStainDeepLearning and modified as needed. This repository can be used for any image classification task.

Our paper: Deep learning-based classification of breast cancer molecular subtypes from H&E whole-slide images
Original paper: Multistain deep learning for prediction of prognosis and therapy response in colorectal cancer

The codes for optimum thresholding and the XGboost implementation can be found here: https://github.com/TMasoud

Getting started

Dependencies

This project requires Python 3 (3.6.9) with the following additional packages:

PyTorch (torch==1.9.0, torchvision==0.10.0) with CUDA support
NumPy (1.18.1)
tqdm (4.64.0)
lifelines (0.25.7)
scikit-learn (0.24.2)
matplotlib (3.3.4)
pandas (0.24.2)
TensorBoard (2.1.0)
Pillow (7.1.2)
Captum (0.4.0)

General usage

For getting started and general usage of the codes, you can check the original repository here: https://github.com/AGFoersch/MultiStainDeepLearning.

To use your own data for the classification task, you will need to create your own .csv files describing the data and .json files describing the configuration as shown below.

Data description (.csv)

Each modality requires at least two .csv description files (one for training, one for validation) structured like so:

Patient_ID	Path	Label	Set
pid0	path/to/pid0_0.jpg	A	TRAIN
pid0	path/to/pid0_1.jpg	A	TRAIN
...	...	...	...
pid999	path/to/pid999_0.jpg	B	TRAIN
pid999	path/to/pid999_1.jpg	B	TRAIN

Some notes:

All entries using the same Patient_ID must share the same label.
The .csv file for the training data needs each entry to have the value TRAIN in the Set column. The files for validation or testing need to have the value VALID in the Set column.
The Patient_ID, Path and Set column names must exist exactly with this spelling. The Label column, on the other hand, can be named however you like and the labels themselves can have whatever names you like (just be consistent, of course).
If you specify the data_root value in your configuration files, the paths in the Path column will be treated as relative to that.

Configuration (.json)

The configuration .json files that describe model training and evaluation are structured as follows (minus the comments):

{
    "name": "BC_Res18",                                                            // Experiment name, will be appended to save_dir (see trainer).
    "n_gpu": 1,                                                                                 // Number of GPUs, should stay at 1.

    "arch": {
        "type": "MultiModel",                                                                   // Model architecture class name. Both uni- and multimodal training uses this class.
        "args": {
          "num_classes": 2,                                                                     // Number of classes, should be >= 2.
          "lo_dims": [24],                                                                      // Number of output features for each unimodal model.
                                                                                                // --> Only one entry for unimodal training.
          "lo_pretrained": [                                                                    // Paths to the pretrained unimodal models. Omit this entry for unimodal training.
            "./saved/path/to/checkpoint/for/modality0.pth"
          ],
          "mmhid": 64,                                                                          // Input size for the final classification layer.
          "dropout_rate": 0.33                                                                  // Dropout rate only applies during multimodal training.
        }
    },
    "data_loader": {
        "type": "BasicMixDataLoader",                                                           // Data loader class name.
        "args":{
          "data_root": ".data/",                                                                // Prefix for all paths in the data description .csv files.
          "dataframes": [                                                                       // Paths to .csv files describing the training data, one for each modality in use.
            "./config/csv/TRAIN.csv"
          ],
          "labels": ["Tumor_Label"],                                                            // Name of the label column in your .csv files
          "dataframe_valid": "./config/csv/VALID.csv",                                 // Path to data description used for validation/early stopping.
                                                                                                // These are different files for the unimodal and multimodal cases, see the previous section.
          "valid_columns": ["modality0_Path"],                                                  // Name of the columns containing the paths for each modality.
                                                                                                // In the unimodal case, the list contains only "Path".
          "shuffle": true,                                                                      // data shuffling only applies to training data.
          "num_workers": 8,
          "batch_size": 128                                                                      // One dataset element contains one image from every modality in use
                                                                                                // --> Batches take up more memory during multimodal training.
       }
  },
    "transformations":{
      "type": "MyImageAugmentation",                                                              // See datahandler/transforms/data_transforms.py for possible options.
      "args": {
        "size": 512
      }
    },
    "optimizer": {                                                                              // Accepts any optimizer from torch.optim along with its arguments.
        "type": "Adam",
        "args":{
            "lr": 0.00001
        }
    },
    "loss": {                                                                                   
        "type": "CrossEntropyLoss",
        "args": {
        }
    },

    "metrics": {                                                                                // Metrics to track during training.
      "epoch": [
        {
          "type":  "accuracy_epoch",
          "args":  {}
      }
    ],
      "running": []
    },
    "trainer": {
        "type": "BasicMultiTrainer",
        "args": {},
        "epochs": 1000,                                                                         // Number of epochs to train for.
        "save_dir": "saved/",                                                                   // Prefix for the save directory (see "name" key)
        "save_period": 50,                                                                      // Save a model checkpoint every x epochs.
        "val_period": 1,                                                                        // Validate model every x epochs, save a checkpoint if best performance yet.
        "verbosity": 2,                                                                         // Between 0 and 2. 0 is least verbose, 2 most.
        "freeze": true,                                                                         // Freeze unimodal models during multimodal training?
        "unfreeze_after": 50,                                                                   // If freeze is true, unfreeze unimodal models after this many epochs.
        "monitor": ["max val_accuracy_epoch", "min val_loss_epoch"],                            // Metrics to monitor for determining best performance.
        "tensorboard": true,                                                                    // Track training with TensorBoard?
        "evaluation": true                                                                      // Evaluate model performance on validation data when finished with training?
    }
}

Citation

If any part of this code is used, please give appropriate citation to original authors paper:

Foersch, S., Glasner, C., Woerl, AC. et al. Multistain deep learning for prediction of prognosis and therapy response in colorectal cancer. Nat Med (2023). https://doi.org/10.1038/s41591-022-02134-1

License

This project is licensed under the GNU GPLv3 license.

Acknowledgements

This project's structure is based on the PyTorch Template Project by Victor Huang, licensed under the MIT License - see LICENSE-3RD-PARTY for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

BC_MolSubtyping

Getting started

Dependencies

General usage

Data description (.csv)

Configuration (.json)

Citation

License

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

BC_MolSubtyping

Getting started

Dependencies

General usage

Data description (.csv)

Configuration (.json)

Citation

License

Acknowledgements