Model interporation with new drawing code of RDKit #RDKit #Machine learning #chemoinformatics

Following code does not use new drawing code but it revised one of my old post. :)

I think, everyone who visits my blog has already read Gregs nice blog post about the drawing similarity map with new code. If you don’t read it I recommend to read it soon. URL is below. ;)

Now I like the drawing code because it can draw only atom property of own molecule but also compare molecular similarity and fingerprint contribution of predictive model.

Interpolation of the machine learning model is very important because it is use for next design of the new molecule.

So let’s write code with new drawing code. Following code is almost same as RDKit cook book and Greg’s blog post. Import some packages and check the version of rdkit. I used the newest version.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from functools import partial
from rdkit import Chem
from rdkit import rdBase
from rdkit import RDPaths
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import SimilarityMaps, IPythonConsole
from IPython.display import SVG
import io
from PIL import Image
import rdkit
import numpy as np
from sklearn.ensemble import RandomForestClassifier

Load solubility data which is provided from rdkit data folder for example.

trainpath = os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf')
testpath =  os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath) if m is not None]
test_mols = [m for m in Chem.SDMolSupplier(testpath) if m is not None]
val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}

Then define some helper function.

def mol2arr(mol, fpfunc):
    fp = fpfunc(mol)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
def show_png(data):
    bio = io.BytesIO(data)
    img =
    return img
fpfunc = partial(SimilarityMaps.GetMorganFingerprint, nBits=1024, radius=2)

Then calculate fingerprint and get solubility class. And make randomforest classifier object and fit it. It’s very common way for sunday chemoinformatician. ;)

trainX = [fpfunc(m) for m in train_mols]
trainY = [val_dict[m.GetProp('SOL_classification')] for m in train_mols]
testX = [fpfunc(m) for m in test_mols]
testY = [val_dict[m.GetProp('SOL_classification')] for m in test_mols]
rfc = RandomForestClassifier(random_state=794), trainY)

Next define getProba function the code is borrowed from rdkit cookbook. One difference to cookbook is I used third probability because I would like to visualize effect of the each fingerprint to the high solubility. And defined draw function with some optional arguments. ‘alpha’ is useful because it can control the blending value for the contour lines. It is key for beautiful visualization.

# Following example I would like to get probability of High solubility
def getProba(fp, predctionFunction):
    return predctionFunction((fp,))[0][2]
def drawmol(mols, idx):
    d = Draw.MolDraw2DCairo(1,1)
    _, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, rfc.predict_proba),

Now almost there. Let’s visualize High and Low solubility molecules with the drawing code!

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']
drawmol(high_test_mols, 0)

Hmm good, the model detects hydroxyl group is positive effect to solubility. How about second example?

drawmol(high_test_mols, 10)

OK, carbonyl group detected as positive group for solubility.

drawmol(low_test_mols, 5)

In this case it is difficult because all atoms are lipophilic. So remove any atoms are positive effect to solubility in my understanding.

The code investigates the effect of the each atom by removing one by one. And calculate difference of probability of prob molecule and atom remove molecule.

So, the magnitude of color shows relative effect of the each atom not absolute effect. This is reason why low solubility molecule shows many positive (red) colors in this example.

Any way, model interporation with 2D color mapping on the molecule is quite nice tool for QSAR analysis.

I would like to write more example codes.

Today’s code can be found on my gist and github.

Any comments suggestions and requests will be greatly appreciated. ;)