This document describes the clean, modular structure of the Tox21 modeling pipeline.
The pipeline has been refactored into a clean, modular structure with the following components:
tox21_models/
├── src/
│ ├── pipeline_manager.py # Main orchestrator
│ ├── model_trainer.py # Model training and evaluation
│ ├── result_manager.py # Result saving/loading
│ ├── data_preparation.py # Data loading and preprocessing
│ ├── feature_selector.py # Feature selection
│ └── ... (other modules)
├── config.py # Centralized configuration
├── run_pipeline.py # Simple main script
├── utils.py # Utility functions
└── MODULAR_STRUCTURE.md # This file
python run_pipeline.pypython utils.pyMain orchestrator that coordinates the entire workflow.
Key Features:
- Data loading and preprocessing
- Feature selection
- Model training and evaluation
- Progress tracking and logging
- Resume capability
Usage:
from src.pipeline_manager import PipelineManager
from config import PIPELINE_CONFIG
pipeline = PipelineManager(PIPELINE_CONFIG)
results = pipeline.run_pipeline(target_indices=[0, 7])Handles model training and evaluation with cross-validation.
Key Features:
- Multiple model types (RandomForest, LogisticRegression, SVM)
- Cross-validation training
- Performance evaluation
- Result aggregation
Usage:
from src.model_trainer import ModelTrainer
trainer = ModelTrainer(
models=['RandomForest', 'LogisticRegression'],
cv_folds=5,
random_state=42
)
results = trainer.train_and_evaluate(X, y, target_name)Manages saving and loading of results, models, and reports.
Key Features:
- Save/load trained models
- Save/load feature selectors
- Generate comparison reports
- Export results for sharing
Usage:
from src.result_manager import ResultManager
result_manager = ResultManager('results')
result_manager.save_target_results(target_name, results, feature_selector, selected_features)Utility functions for common operations.
Key Features:
- Load trained models
- Make predictions on new compounds
- Compare target performance
- Generate prediction reports
Usage:
from utils import Tox21Utils
utils = Tox21Utils()
model_data = utils.load_model_and_features('NR-AR')
predictions = utils.predict_toxicity('NR-AR', descriptors, feature_names)All settings are centralized in config.py:
PIPELINE_CONFIG = {
'random_state': 42,
'cv_folds': 5,
'models': ['RandomForest', 'LogisticRegression', 'SVM'],
'feature_selection': {
'correlation_threshold': 0.90,
'univariate_k': 500,
'top_n_model': 150
},
# ... more settings
}results/
├── {target}_results.pkl # Detailed training results
├── {target}_best_model.pkl # Best trained model
├── {target}_feature_selector.pkl # Feature selector
├── {target}_selected_features.npy # Selected feature names
├── {target}_model_comparison.csv # Model comparison report
└── pipeline_summary.csv # Overall summary
logs/
├── pipeline_{timestamp}.log # Detailed execution log
└── pipeline_progress.json # Progress tracking
- Each class has a single responsibility
- Easy to modify individual components
- Clear interfaces between modules
- Comprehensive logging at each step
- Isolated components for testing
- Clear error messages and stack traces
- All settings in one place
- Easy to experiment with different parameters
- Environment-specific configurations
- Components can be used independently
- Easy to extend with new models/features
- Clean APIs for integration
- Automatic progress saving
- Resume capability after interruption
- Detailed execution logs
- Add model configuration to
config.py - Update
ModelTrainer._get_model_configs() - No changes needed in other components
- Modify settings in
config.py - Or update
PIPELINE_CONFIG['feature_selection'] - Pipeline automatically uses new settings
# Process specific targets
results = pipeline.run_pipeline(target_indices=[0, 1, 2])
# Process all targets
results = pipeline.run_pipeline()from utils import Tox21Utils
utils = Tox21Utils()
model_data = utils.load_model_and_features('NR-AR')
# Make predictions
predictions = utils.predict_toxicity('NR-AR', new_descriptors, feature_names)tail -f logs/pipeline_*.logpython utils.py # Lists available targets and performancepython run_pipeline.py # Automatically resumes from last saved stateThe pipeline provides comprehensive performance tracking:
- Real-time logging with timestamps
- Progress tracking with resume capability
- Performance metrics for each model
- Summary reports for easy comparison
- Export functionality for sharing results
- Update
ModelTrainer._get_model_configs() - Add model to
PIPELINE_CONFIG['models']
- Update
ModelTrainer._calculate_metrics() - Add metric to
EVALUATION_METRICSinconfig.py
- Extend
FeatureSelectorclass - Update configuration as needed
This modular structure makes the codebase maintainable, debuggable, and extensible while providing a clean, professional interface for Tox21 modeling.