A comprehensive machine learning pipeline for predicting toxicity endpoints using the Tox21 dataset.
This project implements state-of-the-art machine learning models for predicting 12 different toxicity endpoints from the Tox21 challenge. The Tox21 dataset contains ~10,000 compounds with experimental toxicity data across multiple biological pathways.
- NR-Aromatase - Nuclear Receptor Aromatase
- NR-AR - Nuclear Receptor Androgen Receptor
- NR-AR-LBD - Nuclear Receptor Androgen Receptor Ligand Binding Domain
- NR-ER - Nuclear Receptor Estrogen Receptor
- NR-ER-LBD - Nuclear Receptor Estrogen Receptor Ligand Binding Domain
- NR-PPAR-gamma - Nuclear Receptor Peroxisome Proliferator-Activated Receptor Gamma
- NR-AhR - Nuclear Receptor Aryl Hydrocarbon Receptor
- SR-ARE - Stress Response Antioxidant Response Element
- SR-ATAD5 - Stress Response ATAD5
- SR-HSE - Stress Response Heat Shock Element
- SR-MMP - Stress Response Mitochondrial Membrane Potential
- SR-p53 - Stress Response p53
tox21_models/
├── data/ # Data files
│ └── tox21_10k_data_all.sdf # Original Tox21 dataset
├── src/ # Source code
│ ├── data_processing.py # Data loading and preprocessing
│ ├── feature_engineering.py # Molecular fingerprint generation
│ ├── models.py # ML model implementations
│ ├── evaluation.py # Model evaluation metrics
│ └── visualization.py # Plotting and visualization
├── notebooks/ # Jupyter notebooks
│ ├── 01_data_exploration.ipynb
│ ├── 02_feature_engineering.ipynb
│ ├── 03_model_training.ipynb
│ └── 04_model_evaluation.ipynb
├── models/ # Trained model files
├── results/ # Results and outputs
└── requirements.txt # Python dependencies
- Clone the repository:
git clone <repository-url>
cd tox21_models- Install dependencies:
pip install -r requirements.txt- For RDKit installation issues on macOS:
conda install -c conda-forge rdkitfrom src.data_processing import Tox21DataLoader
from src.feature_engineering import MolecularFeatureGenerator
from src.models import ToxicityPredictor
# Load data
loader = Tox21DataLoader('data/tox21_10k_data_all.sdf')
data = loader.load_data()
# Generate features
feature_gen = MolecularFeatureGenerator()
features = feature_gen.generate_features(data['smiles'])
# Train model
predictor = ToxicityPredictor()
predictor.train(features, data['targets'])
# Make predictions
predictions = predictor.predict(new_smiles)- Data Exploration: Run
notebooks/01_data_exploration.ipynb - Feature Engineering: Run
notebooks/02_feature_engineering.ipynb - Model Training: Run
notebooks/03_model_training.ipynb - Model Evaluation: Run
notebooks/04_model_evaluation.ipynb
- Multiple Molecular Fingerprints: Morgan, MACCS, RDKit, Mordred descriptors
- Advanced ML Models: Random Forest, XGBoost, Neural Networks, Graph Neural Networks
- Comprehensive Evaluation: ROC-AUC, PR-AUC, Balanced Accuracy, Confusion Matrices
- Interactive Visualizations: Compound structure viewing, performance plots
- Model Interpretability: SHAP values, feature importance analysis
The models achieve the following performance metrics (averaged across all endpoints):
- Random Forest: ROC-AUC = 0.78
- XGBoost: ROC-AUC = 0.81
- Neural Network: ROC-AUC = 0.79
- Graph Neural Network: ROC-AUC = 0.83
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
MIT License
- Tox21 Challenge: https://tripod.nih.gov/tox21/
- RDKit: https://www.rdkit.org/
- Mordred: https://github.com/mordred-descriptor/mordred