NPstereo

This repository contains the code and resources for predicting the absolute stereochemistry of natural products (NPs) using a transformer-based model. The models are trained to accurately predict stereocenters from the absolute SMILES representation of a chemical compound.

Theory

The objective of our model is to predict the absolute configuration of natural products based on their absolute SMILES representations. We employ a transformer model implemented using OpenNMT to translate absolute SMILES into isomeric SMILES. Our dataset is derived from the latest version (09-2024) of the COCONUT database, which is the largest repository of natural products. The most effective model achieves a per-stereocenter accuracy exceeding 80% on full assigments and a per-stereocenter accuracy above 85% for partial assignments. The repository includes comprehensive code for data extraction, preprocessing, model training, and evaluation. However, the repository will be subject to updates and improvements in the near future.

Getting started

To replicate the results of our model, follow the instructions below.

1. Clone the repository

git clone https://github.com/reymond-group/NPstereo.git

2. Download the data/ directory from the Zenodo repository.

Zenodo NPstereo repository

3. Install the required conda environments using the following command:

conda env create -f npstereo.yml
conda env create -f npstereotmap.yml

Since the TMAP package does not support Python versions >3.7 we create a separate environment for notebooks that generate the TMAP plots.

4. Run the code in the notebooks to reproduce the results.

5. Running on your own data.

You can run the predictions of NPstereo on your own data by downloading the NPstereo model (partial_augmented_5x) from the zenodo repository and placing it into the models directory. Then modify the literature-dataset.xlsx file to contain your wanted structures and run the code in the "09-new-assignments" notebook. Prediction time for the examples presented in the provided dataset is a few seconds.

Notebooks

The notebooks are organized as follows:

01-dataset: Contains the SQL query to extract the dataset from the PostgreSQL dump and the preprocessing steps to clean up the dataset.
02-augment-data: Contains the code to augment the dataset via SMILES randomization.
03-prepare-dataset: Contains the code to prepare the dataset in the format required by OpenNMT for training the model.
04-train: Contains the code to train the transformer model using OpenNMT. (this is a python script, not a notebook)
05-predict: Contains the code to run the predictions on the test set.
06-evaluate: Contains the code to evaluate the model's performance.
07-analysis: Contains the code to generate the TMAP plots and the in-depth analysis of the model's performance.
08-partial-assignments: Contains the code to run the predictions on a the set of incompletely assigned compounds in COCONUT.
09-new_assignments: Contains the code to run the predictions on a small set of manually curated compounds to validate the model's performance.

License

MIT

Name	Name	Last commit message	Last commit date
Latest commit markusorsi Tracking previously ignored file Feb 14, 2025 6147573 · Feb 14, 2025 History 9 Commits
data	data	Tracking previously ignored file	Feb 14, 2025
models	models	Resolved data leakage issues. Additional code improvements.	Feb 4, 2025
.gitignore	.gitignore	add small test set	Feb 14, 2025
01-extract-dataset.ipynb	01-extract-dataset.ipynb	Resolved data leakage issues. Additional code improvements.	Feb 4, 2025
02-augment-data.ipynb	02-augment-data.ipynb	Resolved data leakage issues. Additional code improvements.	Feb 4, 2025
03-prepare-datasets.ipynb	03-prepare-datasets.ipynb	Resolved data leakage issues. Additional code improvements.	Feb 4, 2025
04-train.py	04-train.py	Resolved data leakage issues. Additional code improvements.	Feb 4, 2025
05-predict.ipynb	05-predict.ipynb	Resolved data leakage issues. Additional code improvements.	Feb 4, 2025
06-evaluate.ipynb	06-evaluate.ipynb	Resolved data leakage issues. Additional code improvements.	Feb 4, 2025
07-analysis.ipynb	07-analysis.ipynb	add small test set	Feb 14, 2025
08-partial-assignment.ipynb	08-partial-assignment.ipynb	add small test set	Feb 14, 2025
09-new-assignment.ipynb	09-new-assignment.ipynb	add small test set	Feb 14, 2025
LICENSE	LICENSE	Update gitignore	Sep 19, 2024
README.md	README.md	Update README.md	Feb 14, 2025
eval_functions.py	eval_functions.py	Resolved data leakage issues. Additional code improvements.	Feb 4, 2025
npstereo.yml	npstereo.yml	Initial commit	Sep 19, 2024
npstereotmap.yml	npstereotmap.yml	Initial commit	Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NPstereo

Theory

Getting started

1. Clone the repository

2. Download the data/ directory from the Zenodo repository.

3. Install the required conda environments using the following command:

4. Run the code in the notebooks to reproduce the results.

5. Running on your own data.

Notebooks

License

Contact

About

Releases

Packages

Languages

License

markusorsi/NPstereo

Folders and files

Latest commit

History

Repository files navigation

NPstereo

Theory

Getting started

1. Clone the repository

2. Download the data/ directory from the Zenodo repository.

3. Install the required conda environments using the following command:

4. Run the code in the notebooks to reproduce the results.

5. Running on your own data.

Notebooks

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages