Molecular encoder/decoder (VAE) with python3.x (not new topic) #rdkit #chemoinformatics

The first day of 10-day holiday is rainy. And I and my kid will go to dodge ball tournament.

Two years ago, I tried to modify the keras-molecule which is code of molecular encoder decoder. The code is written for python 2.x. So I would like to run the code on python 3.6. I stopped the modification because I often use pytorch and found molencoder for pytorch. https://github.com/cxhernandez/molencoder

Today I tried to modify keras-molecule again.
I think almost done and the code will run on python 3.x, with keras 2.x tensorflow backend.

https://github.com/iwatobipen/keras-molecules

I checked the code on google colab. Because google colab provides GPU environment. It is useful for deep learning. At first, I installed conda and several packages which are used in keras-molecule.

Following code was written on google colab.

# install conda
!wget https://repo.continuum.io/miniconda/Miniconda3-4.4.10-Linux-x86_64.sh
!chmod +x ./Miniconda3-4.4.10-Linux-x86_64.sh
!time bash ./Miniconda3-4.4.10-Linux-x86_64.sh -b -f -p /usr/local

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
# install packages via conda command
!time conda install -y -c conda-forge rdkit
!conda install -y -c conda-forge h5py
!conda install -y -c conda-forge scikit-learn
!conda install -y -c conda-forge pytables
!conda install -y -c conda-forge progressbar
!conda install -y -c conda-forge pandas
!conda install -y -c conda-forge keras

For in my case, rdkit installation took very long time, 20min or more…. I waited patiently.

After installation, check rdkit. Installed old version. But it does not matter.

from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)
> 2018.09.1

Then clone code from github repo, change current dir and download sample smiles. Keras-molecule uses git-lfs for data strage. So files which are stored is not real .h5 files. I skipped installation of git-lfs to colab.

!git clone https://github.com/iwatobipen/keras-molecules
os.chdir('keras-molecules/')
# original data is stored with git-lft. So it can't get directly with git.
# So I uploaded small sample data to my repo.
!wget https://github.com/iwatobipen/playground/raw/master/smiles_50k.h5

Now ready, preprocess is needed at first. It is easy to do it.
After finished processed.h5 file will be stored in data/ .

!python preprocess.py smiles_50k.h5 data/processed.h5

Then load the data and do training. I did it on notebook, so following code is different from original README section.

import train
data_train, data_test, charset = train.load_dataset('data/processed.h5')
model = train.MoleculeVAE()
model.create(charset,  latent_rep_size=292)
checkpointer = train.ModelCheckpoint(filepath = 'model.h5',
                                   verbose = 1,
                                   save_best_only = True)

reduce_lr = train.ReduceLROnPlateau(monitor = 'val_loss',
                                  factor = 0.2,
                                  patience = 3,
                                  min_lr = 0.0001)
model.autoencoder.fit( data_train,
                  data_train,
                  shuffle = True,
                  epochs = 20,
                  batch_size = 600,
                  callbacks = [checkpointer, reduce_lr],
                  validation_data = (data_test, data_test)
            )

It took 20-30min for training on google colab with GPU. I don’t recommend run the code on PC which does not have GPU. I will take too long time for training.

After training, I ran interpolate. Interpolate samples molecules between SORUCE and DEST from latent space. Ideally the approach can change molecule gradually because latent space is continuous space. It sounds interesting.

import interpolate
h5f = interpolate.h5py.File('data/processed.h5' ,'r')
charset = list(h5f['charset'][:])
charset = [ x.decode('utf-8') for x in charset ]
h5f.close()

LATENT_DIM = 292
WIDTH = 120
model.load(charset, 'model.h5', latent_rep_size = LATENT_DIM)

I set two molecules for SOURCE and DEST.

SOURCE = 'CC2CCN(C(=O)CC#N)CC2N(C)c3ncnc1[nH]ccc13'
DEST = 'CCS(=O)(=O)N1CC(C1)(CC#N)N2C=C(C=N2)C3=C4C=CNC4=NC=N3'
STEPS = 10
results = interpolate.interpolate(SOURCE, DEST, STEPS, charset, model, LATENT_DIM, WIDTH)
for result in results:
  print(result[2])

The result was….

Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCcccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCCCccccccccccccc Cr1ccccCCCCCCCCCCCCCCCCCCCCCCCCCCcccccccccc Cr1CcccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCccccccc Cr1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCcccc CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Opps! All molecules are not molecules. It just almost C/c……

The model will be improved if I set more large number of epochs and give more learning data.

Naive VAE approach is difficult to generate valid molecules. Now there are many approaches about molecule generation are reported. JT-VAE, Grammer-VAE, etc, etc…. What kind of generator do you like?

Today’s code is redundant but uploaded my github repo.

https://github.com/iwatobipen/playground/blob/master/mol_vae.ipynb

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.