Read maestro format file from RDKit

RDKitter knows that Schrodinger contributes RDKit I think.
https://www.schrodinger.com/news/schr%C3%B6dinger-contributes-rdkit

Schrodinger provides many computational tools for drug discovery, that is not only GUI tool but also python API. Many tool can call from python and also RDKit. And RDKit can read maestro file vise versa.
It is easy to do it like reading SDFiles.
I am writing the blog post on my personal PC and I do not have schrodinger software’s license. So I got test files from schrodingers github repository.
https://github.com/schrodinger

On ipython notebook

!wget https://raw.githubusercontent.com/schrodinger/maeparser/master/test/test.mae
!wget https://github.com/schrodinger/maeparser/raw/master/test/test2.maegz
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import rdmolfiles
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
import gzip
rdDepictor.SetPreferCoordGen(True)
rdBase.rdkitVersion
>'2018.09.1'

Read mols from mae format.

maemols = rdmolfiles.MaeMolSupplier("test.mae")
mols = [m for m in maemols]
Draw.MolsToGridImage(mols)

Read mols form maegz

maemols2 = rdmolfiles.MaeMolSupplier(gzip.open("test2.maegz"))
mols2 = [m for m in maemols2]
Draw.MolsToGridImage(mols2)

Both cases work fine.
If user want to integrate rdkit and schrodinger tools, the method will be useful.
Note book uploaded my github repo.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/maeparser.ipynb

Advertisements

Run rdkit and deep learning on Google Colab! #RDKit

If you can not use GPU on your PC, it is worth to know that you can use GPU and/or TPU on google colab. Now you can use google colab no fee.
So, I would like to use rdkit on google colab and run deep learning on the app.
Today I tried it.
At first I installed RDKit on the instance. It is not so difficult.
Now default version of python is 3.7. Fortunately conda-forge provides us rdkit-py37 recipe!

The code is below.

!wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!time bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

Then append rdkit path to current python system path.

%matplotlib inline
import matplotlib.pyplot as plt
import sys
import os
sys.path.append('/usr/local/lib/python3.7/site-packages/')

Now I can import rdkit! Next import rdkit and load dataset.

import numpy as np
import pandas as pd

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit.Chem import RDConfig
from rdkit import rdBase
from rdkit.Chem.Draw import IPythonConsole

trainsdf = Chem.SDMolSupplier(os.path.join( RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf'))
testsdf = Chem.SDMolSupplier(os.path.join( RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf'))
train_mols = [mol for mol in trainsdf if mol != None]
test_mols = [mol for mol in testsdf if mol != None]
sol_class = {"(A) low":0, "(B) medium":1, "(C) high": 2}

Next, define mol to fingerprint function and make dataset for deep learning.

# 2048 bit vector
def mol2arr(mol):
  arr = np.zeros((1,))
  fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
  DataStructs.ConvertToNumpyArray(fp, arr)
  return arr

In this blog post, I use keras for convenience.

import tensorflow as tf
import tensorflow.keras as keras
from keras import Model
from keras.layers import Activation, Dense, Dropout, Input
from keras.utils import np_utils


trainX = np.array([mol2arr(mol) for mol in train_mols])
trainY = [sol_class[mol.GetProp("SOL_classification")] for mol in train_mols]
trainY = np_utils.to_categorical(trainY)

testX = np.array([mol2arr(mol) for mol in test_mols])
testY = [sol_class[mol.GetProp("SOL_classification")] for mol in test_mols]
testY = np_utils.to_categorical(testY)

Define model, and I would like to check intermediate layer, so I also defined model_int.

inputs = Input(shape=(2048,), name='input')
x = Dense(100, activation='relu', name='Layer1')(inputs)
x = Dense(20, activation='relu', name='Layer2')(x)
x = Dense(2, activation='relu', name='Layer3')(x)
predictions = Dense(3, activation='softmax', name='output')(x)

model = Model(inputs=inputs, outputs = predictions)
model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
                           loss='categorical_crossentropy',
                           metrics=['accuracy'])
model.summary()


model_int = Model(inputs=inputs, outputs=model.get_layer(name='Layer3').output)
model_int.summary()

OK, let’s train model!

epochs = 50
hist = model.fit(np.array(trainX), trainY, batch_size=32, epochs=epochs)
plt.plot(range(epochs), hist.history['acc'])
plt.plot(range(epochs), hist.history['loss'])

50 epochs is too long for the training.
Blue: training_acc, green: training_loss

Next check the model performance.

from sklearn.metrics import classification_report
predY = model.predict(testX)
predY = [np.argmax(y) for y in predY]
Y = [np.argmax(y) for y in testY]
print(classification_report(Y, predY))
"""
             precision    recall  f1-score   support

          0       0.67      0.80      0.73       102
          1       0.75      0.61      0.67       115
          2       0.73      0.75      0.74        40

avg / total       0.72      0.71      0.71       257
"""

Hmm it is not so bad…

And I would like to visualize relation ship between last layer and label.

intdata = model_int.predict(testX)
colormap = {0:"red", 1:"blue", 2:"green"}
c = [colormap[i] for i in Y]
plt.scatter(intdata[:,0], intdata[:,1], c=c)

The image is below.

It is not so clear. Am I something wrong? There is a space for improve.
But it is worth to know that google colab is useful for test your code of machine learning.

Whole code can be checked from the following URL.
https://colab.research.google.com/drive/1_m15X_1UoEzPRkFi9vditQ9xM9hemzhv