Run rdkit and deep learning on Google Colab! #RDKit

If you can not use GPU on your PC, it is worth to know that you can use GPU and/or TPU on google colab. Now you can use google colab no fee.
So, I would like to use rdkit on google colab and run deep learning on the app.
Today I tried it.
At first I installed RDKit on the instance. It is not so difficult.
Now default version of python is 3.7. Fortunately conda-forge provides us rdkit-py37 recipe!

The code is below.

!wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!time bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

Then append rdkit path to current python system path.

%matplotlib inline
import matplotlib.pyplot as plt
import sys
import os
sys.path.append('/usr/local/lib/python3.7/site-packages/')

Now I can import rdkit! Next import rdkit and load dataset.

import numpy as np
import pandas as pd

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit.Chem import RDConfig
from rdkit import rdBase
from rdkit.Chem.Draw import IPythonConsole

trainsdf = Chem.SDMolSupplier(os.path.join( RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf'))
testsdf = Chem.SDMolSupplier(os.path.join( RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf'))
train_mols = [mol for mol in trainsdf if mol != None]
test_mols = [mol for mol in testsdf if mol != None]
sol_class = {"(A) low":0, "(B) medium":1, "(C) high": 2}

Next, define mol to fingerprint function and make dataset for deep learning.

# 2048 bit vector
def mol2arr(mol):
  arr = np.zeros((1,))
  fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
  DataStructs.ConvertToNumpyArray(fp, arr)
  return arr

In this blog post, I use keras for convenience.

import tensorflow as tf
import tensorflow.keras as keras
from keras import Model
from keras.layers import Activation, Dense, Dropout, Input
from keras.utils import np_utils


trainX = np.array([mol2arr(mol) for mol in train_mols])
trainY = [sol_class[mol.GetProp("SOL_classification")] for mol in train_mols]
trainY = np_utils.to_categorical(trainY)

testX = np.array([mol2arr(mol) for mol in test_mols])
testY = [sol_class[mol.GetProp("SOL_classification")] for mol in test_mols]
testY = np_utils.to_categorical(testY)

Define model, and I would like to check intermediate layer, so I also defined model_int.

inputs = Input(shape=(2048,), name='input')
x = Dense(100, activation='relu', name='Layer1')(inputs)
x = Dense(20, activation='relu', name='Layer2')(x)
x = Dense(2, activation='relu', name='Layer3')(x)
predictions = Dense(3, activation='softmax', name='output')(x)

model = Model(inputs=inputs, outputs = predictions)
model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
                           loss='categorical_crossentropy',
                           metrics=['accuracy'])
model.summary()


model_int = Model(inputs=inputs, outputs=model.get_layer(name='Layer3').output)
model_int.summary()

OK, let’s train model!

epochs = 50
hist = model.fit(np.array(trainX), trainY, batch_size=32, epochs=epochs)
plt.plot(range(epochs), hist.history['acc'])
plt.plot(range(epochs), hist.history['loss'])

50 epochs is too long for the training.
Blue: training_acc, green: training_loss

Next check the model performance.

from sklearn.metrics import classification_report
predY = model.predict(testX)
predY = [np.argmax(y) for y in predY]
Y = [np.argmax(y) for y in testY]
print(classification_report(Y, predY))
"""
             precision    recall  f1-score   support

          0       0.67      0.80      0.73       102
          1       0.75      0.61      0.67       115
          2       0.73      0.75      0.74        40

avg / total       0.72      0.71      0.71       257
"""

Hmm it is not so bad…

And I would like to visualize relation ship between last layer and label.

intdata = model_int.predict(testX)
colormap = {0:"red", 1:"blue", 2:"green"}
c = [colormap[i] for i in Y]
plt.scatter(intdata[:,0], intdata[:,1], c=c)

The image is below.

It is not so clear. Am I something wrong? There is a space for improve.
But it is worth to know that google colab is useful for test your code of machine learning.

Whole code can be checked from the following URL.
https://colab.research.google.com/drive/1_m15X_1UoEzPRkFi9vditQ9xM9hemzhv

Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

5 thoughts on “Run rdkit and deep learning on Google Colab! #RDKit

  1. Hi,
    Thanks for this helpful post. I have tried to install rdkit with above method, but it seems like there is a problem due to python 3.8 and new version. Please can you suggest is there modification required in any part?

      1. Thanks ! it worked :) . Your blog posts are great and helps in learning RDkit.
        Regards,
        Mandar Kulkarni

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: