ClearSMILES is a data augmentation procedure for SMILES. The first goal of CLearSMILES is to minimize the dimensionality of SMILES, i.e., reducing the size of the vocabulary needed to describe a dataset. The second goal of ClearSMILES is to reduce the attention effort a machine learning model has to make to process a SMILES.
First, clone the repository:
To install the python dependencies, you can create a virtual env:
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
pip install -e .
# Test the installation
pytest src/test/download_test.pyClearSMILES is a stochastic data augmentation procedure. Therefore, it is not possible to guarantee that it will always yield the same results. However, by generating a very large number of SMILES (default: 100k randomized SMILES per molecule), the results should be consistent. As random search is time-consuming, the ClearSMILES generation process is designed for cluster parallelization. Here's an example SLURM job:
#!/bin/sh
#SBATCH --time=01:00:00
#SBATCH --job-name=ClearSMILES_gen
#SBATCH --output=logs/ClearSMILES_gen/out_%A_%a.log
#SBATCH --error=logs/ClearSMILES_gen/err_%A_%a.log
#SBATCH --cpus-per-task=4
#SBATCH --mem=4GThe MOSES dataset can be downloaded using:
# Download MOSES dataset
python src/data/download_MOSES.py
# Verify dataset integrity
pytest src/test/download_test.pyThis will download the MOSES dataset to data/raw/MOSES.csv. The dataset contains around 2M molecules with their SMILES representation. The download script will also verify the integrity of the downloaded files.
ClearSMILES is a stochastic data augmentation procedure. Therefore, it is not possible to guarantee that it will always yield the same results. However, by generating a very large number of SMILES (default: 100k randomized SMILES per molecule), the results should be consistent. As random search is time-consuming, the ClearSMILES generation process is designed for cluster parallelization.
python src/features/concatenate2lib.py \
--search_pattern data/interim/ClearSMILES_MOSES_subset_*.parquet \
--output_filepath data/processed/whole_MOSES_ClearSMILES_results.parquetYou can enable multiprocessing with the --use_multiprocessing flag.
To analyze ClearSMILES properties:
- Use
ClearSMILES_analysis.ipynbfor general analysis - Use
memory_score.ipynbto project semantic maps on 2D structures
First, you need to check if all the jobs have successfully completed using: When all the tasks are completed, you aggregate all the data to a single file using: First, check if all jobs have completed successfully:
python src/features/get_failed_gen_tasks.py \
--search_pattern data/interim/ClearSMILES_MOSES_subset_*.parquet \
--output_filepath data/external/failed_task_id.txt \
--job_array_range 1-2000Adjust the search pattern for custom datasets. If successful, you'll see: 'all tasks were successfully completed, no file will be written'. Otherwise, you'll get a success rate and failed task IDs in the output file.
To relaunch failed jobs, modify the SLURM array range parameter:
# Example: Change from
#SBATCH --array=1-2000
# to
#SBATCH --array=3,15,42,1457When complete, aggregate the data:
python src/features/concatenate2lib.py \
--search_pattern data/interim/ClearSMILES_MOSES_subset_*.parquet \
--output_filepath data/processed/whole_MOSES_ClearSMILES_results.parquetYou can use multiprocessing to read the file with the --use_multiprocessing flag.
To analyze ClearSMILES properties:
- Use
ClearSMILES_analysis.ipynbfor general analysis - Use
memory_score.ipynbto project semantic maps on 2D structures
Project based on the cookiecutter data science project template. #cookiecutterdatascience