Skip to content

tonyreina/chemistry

Repository files navigation

Chemistry with TensorFlow (and OpenVINO)

Using TensorFlow to model chemistry problems.

An example of predicting lipophilicity from the molecular formula (SMILES).

This notebook is based on the excellent Kaggle tutorial from Vlad Kisin. In this example, you'll learn how to read a Chemistry datafile and create predictive models of lipophilicity.

Figure1

Lipophilicity is the ability of a chemical compound to dissolve in non-polar (fatty or oily) solvents. In simple terms, if you had a glass of oil and water (which will separate with one on top of the other as in the figure above), then lipophilicity is the proportion of how much a chemical dissolves in the water portion versus the oil portion. In the figure there are 3 molecules in water to every 1 molecule in oil. P is 3 and the log P is $\log_{10}{3} = 0.477$.

Lipophilicity contributes to the absorption, distribution, metabolism, excretion, and toxicity of a pharmaceutical and contributes to a drug's potency and selectivity.

Figure2

I'll demonstrate how to load the raw data from a CSV file and use the RD-Kit and Mol2Vec packages to create features based on the chemical formula of a molecule.

smiles

Installation

I tested this on Ubuntu 18.04 and the Anaconda Python Distribution. To setup the conda environment (which I labeled chem):

conda create -n chem python=3.8 pip jupyter matplotlib seaborn
conda activate chem
conda install -c conda-forge rdkit
pip install git+https://github.com/samoturk/mol2vec
wget https://raw.githubusercontent.com/tonyreina/mol2vec/master/mol2vec/features.py -O  ~/anaconda3/envs/chem/lib/python3.8/site-packages/mol2vec/features.py
wget https://github.com/samoturk/mol2vec_notebooks/blob/master/Notebooks/model_300dim.pkl
pip install -U tensorflow==2.4.1
pip install openvino-tensorflow==0.5.0
conda install scikit-learn
pip install py3Dmol

Run

Run the jupyter notebook chemistry_predict_logP_tensorflow.ipynb

Dataset

The lipophilicity dataset is available on Kaggle and released under the Public Domain (CC0). The raw data is in a CSV file with the SMILES notation of the chemical in the first column and the lipophilicity (logP) in the second column.

About

Using TensorFlow to model chemistry problems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published