Code for paper: Multimodal framework for the joint analysis of single-cell RNA and T cell receptor sequencing data predicts T cell response to cancer immunotherapy
The code has been developed and tested on a high-performance computing system running Ubuntu 20.04.4 LTS. The system is equipped with an AMD EPYC 7513 32-Core Processor, 2 TB of RAM, and eight NVIDIA RTX A6000 GPUs (each with 48 GB VRAM), using CUDA version 12.2 and NVIDIA driver version 535.104.12. The demo application of our pipeline can be run on significantly less powerful hardware, requiring:
- Linux-based system
- at least 20 GB of free disk space
- one GPU
- an internet connection
Create a conda environment using environment.yml (all dependencies are included; whole process takes about 5 min):
conda env create -f environment.yml
conda activate trimInstall the current package in editable mode inside the conda environment:
pip install -e .Before running TRIM, you need to prepare your data in the following format:
All data files should be saved as pickle files in your data directory:
-
data_rna.pkl: NumPy array of shape(n_cells, n_genes)containing normalized RNA-seq expression data (see preprocessing example) -
data_labels.pkl: Pandas DataFrame with the following required columns:Tissue: Binary indicator (0=blood, 1=tumor)Treatment Stage: Binary indicator (0=pre-treatment, 1=post-treatment)Patient: Patient ID (integer, 0-indexed)CDR3(Beta1): TCR CDR3 sequence index (integer index intodf_all_tcrs)
-
data_labels_str.pkl: Pandas DataFrame with string versions of labels (same structure asdata_labels.pkl) -
df_all_tcrs.pkl: Pandas DataFrame with all unique TCR sequences as index (here, we use CDR3 amino acid sequences from beta chain)- Each row index should be a CDR3 amino acid sequence string
- The
CDR3(Beta1)column indata_labels.pklshould contain integer indices (0-indexed) that reference rows in this DataFrame
-
data_tcr.pkl: NumPy array of shape(n_cells, dim_tcr)with learned numeric TCR sequence embeddings for each cell, in the same order as columnCDR3(Beta1)indata_labels.pkl, as produced by runninglearn_tcr_embedding.py(see below).
See ./analysis/HNSCC/data_preprocess/data_processing.py for a complete example of how to:
- Load and normalize RNA-seq data
- Parse TCR sequences from metadata
- Create the required label DataFrames
- Format data for TRIM
For users with preprocessed data, here's a minimal example:
# 1. Activate environment
conda activate trim
# 2. Learn TCR embeddings (update paths in the script first by updating data_path)
python learn_tcr_embedding.py
# 3. Train TRIM model
python trim.py \
--data_parent_folder /path/to/your/data \
--heldout_patient 0 \
--device cuda:0This section contains code to reproduce the figures from our paper. To run TRIM on a new dataset, please follow the instructions in Data Format Requirements and Quick Start.
Illustrative figures: made using PowerPoint
Codes for non-illustrative figures can be found in ./analysis/