An example of predicting lipophilicity from the molecular formula (SMILES).
This notebook is based on the excellent Kaggle tutorial from Vlad Kisin. In this example, you'll learn how to read a Chemistry datafile and create predictive models of lipophilicity.
Lipophilicity is the ability of a chemical compound to dissolve in non-polar (fatty or oily) solvents. In simple terms, if you had a glass of oil and water (which will separate with one on top of the other as in the figure above), then lipophilicity is the proportion of how much a chemical dissolves in the water portion versus the oil portion. In the figure there are 3 molecules in water to every 1 molecule in oil. P is 3 and the log P is
Lipophilicity contributes to the absorption, distribution, metabolism, excretion, and toxicity of a pharmaceutical and contributes to a drug's potency and selectivity.
I'll demonstrate how to load the raw data from a CSV file and use the RD-Kit and Mol2Vec packages to create features based on the chemical formula of a molecule.
I tested this on Ubuntu 18.04 and the Anaconda Python Distribution. To setup the conda environment (which I labeled chem
):
conda create -n chem python=3.8 pip jupyter matplotlib seaborn
conda activate chem
conda install -c conda-forge rdkit
pip install git+https://github.com/samoturk/mol2vec
wget https://raw.githubusercontent.com/tonyreina/mol2vec/master/mol2vec/features.py -O ~/anaconda3/envs/chem/lib/python3.8/site-packages/mol2vec/features.py
wget https://github.com/samoturk/mol2vec_notebooks/blob/master/Notebooks/model_300dim.pkl
pip install -U tensorflow==2.4.1
pip install openvino-tensorflow==0.5.0
conda install scikit-learn
pip install py3Dmol
Run the jupyter notebook chemistry_predict_logP_tensorflow.ipynb
The lipophilicity dataset is available on Kaggle and released under the Public Domain (CC0). The raw data is in a CSV file with the SMILES notation of the chemical in the first column and the lipophilicity (logP) in the second column.