Tox21 Modeling Pipeline - Modular Structure

This document describes the clean, modular structure of the Tox21 modeling pipeline.

🏗️ Architecture Overview

The pipeline has been refactored into a clean, modular structure with the following components:

tox21_models/
├── src/
│   ├── pipeline_manager.py    # Main orchestrator
│   ├── model_trainer.py       # Model training and evaluation
│   ├── result_manager.py      # Result saving/loading
│   ├── data_preparation.py    # Data loading and preprocessing
│   ├── feature_selector.py    # Feature selection
│   └── ... (other modules)
├── config.py                  # Centralized configuration
├── run_pipeline.py           # Simple main script
├── utils.py                  # Utility functions
└── MODULAR_STRUCTURE.md      # This file

🚀 Quick Start

1. Run the Pipeline

python run_pipeline.py

2. Use Utility Functions

python utils.py

📋 Core Components

1. PipelineManager (`src/pipeline_manager.py`)

Main orchestrator that coordinates the entire workflow.

Key Features:

Data loading and preprocessing
Feature selection
Model training and evaluation
Progress tracking and logging
Resume capability

Usage:

from src.pipeline_manager import PipelineManager
from config import PIPELINE_CONFIG

pipeline = PipelineManager(PIPELINE_CONFIG)
results = pipeline.run_pipeline(target_indices=[0, 7])

2. ModelTrainer (`src/model_trainer.py`)

Handles model training and evaluation with cross-validation.

Key Features:

Multiple model types (RandomForest, LogisticRegression, SVM)
Cross-validation training
Performance evaluation
Result aggregation

Usage:

from src.model_trainer import ModelTrainer

trainer = ModelTrainer(
    models=['RandomForest', 'LogisticRegression'],
    cv_folds=5,
    random_state=42
)
results = trainer.train_and_evaluate(X, y, target_name)

3. ResultManager (`src/result_manager.py`)

Manages saving and loading of results, models, and reports.

Key Features:

Save/load trained models
Save/load feature selectors
Generate comparison reports
Export results for sharing

Usage:

from src.result_manager import ResultManager

result_manager = ResultManager('results')
result_manager.save_target_results(target_name, results, feature_selector, selected_features)

4. Tox21Utils (`utils.py`)

Utility functions for common operations.

Key Features:

Load trained models
Make predictions on new compounds
Compare target performance
Generate prediction reports

Usage:

from utils import Tox21Utils

utils = Tox21Utils()
model_data = utils.load_model_and_features('NR-AR')
predictions = utils.predict_toxicity('NR-AR', descriptors, feature_names)

⚙️ Configuration

All settings are centralized in config.py:

PIPELINE_CONFIG = {
    'random_state': 42,
    'cv_folds': 5,
    'models': ['RandomForest', 'LogisticRegression', 'SVM'],
    'feature_selection': {
        'correlation_threshold': 0.90,
        'univariate_k': 500,
        'top_n_model': 150
    },
    # ... more settings
}

📊 Output Structure

results/
├── {target}_results.pkl           # Detailed training results
├── {target}_best_model.pkl        # Best trained model
├── {target}_feature_selector.pkl  # Feature selector
├── {target}_selected_features.npy # Selected feature names
├── {target}_model_comparison.csv  # Model comparison report
└── pipeline_summary.csv           # Overall summary

logs/
├── pipeline_{timestamp}.log       # Detailed execution log
└── pipeline_progress.json         # Progress tracking

🔧 Maintenance Benefits

1. Separation of Concerns

Each class has a single responsibility
Easy to modify individual components
Clear interfaces between modules

2. Easy Debugging

Comprehensive logging at each step
Isolated components for testing
Clear error messages and stack traces

3. Configuration Management

All settings in one place
Easy to experiment with different parameters
Environment-specific configurations

4. Reusability

Components can be used independently
Easy to extend with new models/features
Clean APIs for integration

5. Progress Tracking

Automatic progress saving
Resume capability after interruption
Detailed execution logs

🛠️ Common Operations

Add a New Model

Add model configuration to config.py
Update ModelTrainer._get_model_configs()
No changes needed in other components

Change Feature Selection

Modify settings in config.py
Or update PIPELINE_CONFIG['feature_selection']
Pipeline automatically uses new settings

Process Different Targets

# Process specific targets
results = pipeline.run_pipeline(target_indices=[0, 1, 2])

# Process all targets
results = pipeline.run_pipeline()

Load and Use Trained Models

from utils import Tox21Utils

utils = Tox21Utils()
model_data = utils.load_model_and_features('NR-AR')

# Make predictions
predictions = utils.predict_toxicity('NR-AR', new_descriptors, feature_names)

🐛 Debugging

Check Logs

tail -f logs/pipeline_*.log

Verify Results

python utils.py  # Lists available targets and performance

Resume Interrupted Run

python run_pipeline.py  # Automatically resumes from last saved state

📈 Performance Monitoring

The pipeline provides comprehensive performance tracking:

Real-time logging with timestamps
Progress tracking with resume capability
Performance metrics for each model
Summary reports for easy comparison
Export functionality for sharing results

🔄 Extending the Pipeline

Add New Models

Update ModelTrainer._get_model_configs()
Add model to PIPELINE_CONFIG['models']

Add New Evaluation Metrics

Update ModelTrainer._calculate_metrics()
Add metric to EVALUATION_METRICS in config.py

Add New Feature Selection Methods

Extend FeatureSelector class
Update configuration as needed

This modular structure makes the codebase maintainable, debuggable, and extensible while providing a clean, professional interface for Tox21 modeling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tox21 Modeling Pipeline - Modular Structure

🏗️ Architecture Overview

🚀 Quick Start

1. Run the Pipeline

2. Use Utility Functions

📋 Core Components

1. PipelineManager (`src/pipeline_manager.py`)

2. ModelTrainer (`src/model_trainer.py`)

3. ResultManager (`src/result_manager.py`)

4. Tox21Utils (`utils.py`)

⚙️ Configuration

📊 Output Structure

🔧 Maintenance Benefits

1. Separation of Concerns

2. Easy Debugging

3. Configuration Management

4. Reusability

5. Progress Tracking

🛠️ Common Operations

Add a New Model

Change Feature Selection

Process Different Targets

Load and Use Trained Models

🐛 Debugging

Check Logs

Verify Results

Resume Interrupted Run

📈 Performance Monitoring

🔄 Extending the Pipeline

Add New Models

Add New Evaluation Metrics

Add New Feature Selection Methods

FilesExpand file tree

MODULAR_STRUCTURE.md

Latest commit

History

MODULAR_STRUCTURE.md

File metadata and controls

Tox21 Modeling Pipeline - Modular Structure

🏗️ Architecture Overview

🚀 Quick Start

1. Run the Pipeline

2. Use Utility Functions

📋 Core Components

1. PipelineManager (src/pipeline_manager.py)

2. ModelTrainer (src/model_trainer.py)

3. ResultManager (src/result_manager.py)

4. Tox21Utils (utils.py)

⚙️ Configuration

📊 Output Structure

🔧 Maintenance Benefits

1. Separation of Concerns

2. Easy Debugging

3. Configuration Management

4. Reusability

5. Progress Tracking

🛠️ Common Operations

Add a New Model

Change Feature Selection

Process Different Targets

Load and Use Trained Models

🐛 Debugging

Check Logs

Verify Results

Resume Interrupted Run

📈 Performance Monitoring

🔄 Extending the Pipeline

Add New Models

Add New Evaluation Metrics

Add New Feature Selection Methods

1. PipelineManager (`src/pipeline_manager.py`)

2. ModelTrainer (`src/model_trainer.py`)

3. ResultManager (`src/result_manager.py`)

4. Tox21Utils (`utils.py`)