The first day of 10-day holiday is rainy. And I and my kid will go to dodge ball tournament.
Two years ago, I tried to modify the keras-molecule which is code of molecular encoder decoder. The code is written for python 2.x. So I would like to run the code on python 3.6. I stopped the modification because I often use pytorch and found molencoder for pytorch. https://github.com/cxhernandez/molencoder
Today I tried to modify keras-molecule again.
I think almost done and the code will run on python 3.x, with keras 2.x tensorflow backend.
I checked the code on google colab. Because google colab provides GPU environment. It is useful for deep learning. At first, I installed conda and several packages which are used in keras-molecule.
Following code was written on google colab.
# install conda !wget https://repo.continuum.io/miniconda/Miniconda3-4.4.10-Linux-x86_64.sh !chmod +x ./Miniconda3-4.4.10-Linux-x86_64.sh !time bash ./Miniconda3-4.4.10-Linux-x86_64.sh -b -f -p /usr/local import sys import os sys.path.append('/usr/local/lib/python3.6/site-packages/') # install packages via conda command !time conda install -y -c conda-forge rdkit !conda install -y -c conda-forge h5py !conda install -y -c conda-forge scikit-learn !conda install -y -c conda-forge pytables !conda install -y -c conda-forge progressbar !conda install -y -c conda-forge pandas !conda install -y -c conda-forge keras
For in my case, rdkit installation took very long time, 20min or more…. I waited patiently.
After installation, check rdkit. Installed old version. But it does not matter.
from rdkit import Chem from rdkit import rdBase print(rdBase.rdkitVersion) > 2018.09.1
Then clone code from github repo, change current dir and download sample smiles. Keras-molecule uses git-lfs for data strage. So files which are stored is not real .h5 files. I skipped installation of git-lfs to colab.
!git clone https://github.com/iwatobipen/keras-molecules os.chdir('keras-molecules/') # original data is stored with git-lft. So it can't get directly with git. # So I uploaded small sample data to my repo. !wget https://github.com/iwatobipen/playground/raw/master/smiles_50k.h5
Now ready, preprocess is needed at first. It is easy to do it.
After finished processed.h5 file will be stored in data/ .
!python preprocess.py smiles_50k.h5 data/processed.h5
Then load the data and do training. I did it on notebook, so following code is different from original README section.
import train data_train, data_test, charset = train.load_dataset('data/processed.h5') model = train.MoleculeVAE() model.create(charset, latent_rep_size=292) checkpointer = train.ModelCheckpoint(filepath = 'model.h5', verbose = 1, save_best_only = True) reduce_lr = train.ReduceLROnPlateau(monitor = 'val_loss', factor = 0.2, patience = 3, min_lr = 0.0001) model.autoencoder.fit( data_train, data_train, shuffle = True, epochs = 20, batch_size = 600, callbacks = [checkpointer, reduce_lr], validation_data = (data_test, data_test) )
It took 20-30min for training on google colab with GPU. I don’t recommend run the code on PC which does not have GPU. I will take too long time for training.
After training, I ran interpolate. Interpolate samples molecules between SORUCE and DEST from latent space. Ideally the approach can change molecule gradually because latent space is continuous space. It sounds interesting.
import interpolate h5f = interpolate.h5py.File('data/processed.h5' ,'r') charset = list(h5f['charset'][:]) charset = [ x.decode('utf-8') for x in charset ] h5f.close() LATENT_DIM = 292 WIDTH = 120 model.load(charset, 'model.h5', latent_rep_size = LATENT_DIM)
I set two molecules for SOURCE and DEST.
SOURCE = 'CC2CCN(C(=O)CC#N)CC2N(C)c3ncnc1[nH]ccc13' DEST = 'CCS(=O)(=O)N1CC(C1)(CC#N)N2C=C(C=N2)C3=C4C=CNC4=NC=N3' STEPS = 10 results = interpolate.interpolate(SOURCE, DEST, STEPS, charset, model, LATENT_DIM, WIDTH) for result in results: print(result)
The result was….
Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCcccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCCCccccccccccccc Cr1ccccCCCCCCCCCCCCCCCCCCCCCCCCCCcccccccccc Cr1CcccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCccccccc Cr1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCcccc CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Opps! All molecules are not molecules. It just almost C/c……
The model will be improved if I set more large number of epochs and give more learning data.
Naive VAE approach is difficult to generate valid molecules. Now there are many approaches about molecule generation are reported. JT-VAE, Grammer-VAE, etc, etc…. What kind of generator do you like?
Today’s code is redundant but uploaded my github repo.