Visualize important features of machine leaning #RDKit

As you know, rdkit2018 09 01 has very exiting method named ‘DrawMorganBit’ and ‘DrawMorganBits’. It can render the bit information of fingerprint.
It is described the following blog post.
And if you can read Japanese, Excellent posts are provided.
View story at

What I want to do in the blog post is that visualize important features of random forest classifier.
Fingerprint based predictor is sometime difficult to understand feature importance of each bit. I think that using the new method, I can visualize important features of active molecules.

Let’s try it.
At frist import several packages.

import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import DataStructs
from rdkit.Chem import RDConfig
from rdkit.Chem import rdBase

Then load dataset and define mol2fp function.

trainpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
testpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
solclass = {'(A) low':0, '(B) medium': 1, '(C) high': 2}
n_splits = 10
def mol2fp(mol,nBits=1024):
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, bitInfo=bitInfo)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr, bitInfo
trainmols = [m for m in Chem.SDMolSupplier(trainpath)]
testmols = [m for m in Chem.SDMolSupplier(testpath)]

Then get X and Y.
In the blogpost I used solubility classification data.

trainX = np.array([mol2fp(m)[0] for m in trainmols])
trainY = np.array([solclass[m.GetProp('SOL_classification')] for m in trainmols],

testX = np.array([mol2fp(m)[0] for m in testmols])
testY = np.array([solclass[m.GetProp('SOL_classification')] for m in testmols],

Then train RandomForestClassifier.
I used StratifiedKFold instead of KFold.
The difference of two methods is well described following URL.

rfclf = RandomForestClassifier(n_estimators=50)
skf = StratifiedKFold(n_splits=n_splits)
for train, test in skf.split(trainX, trainY):
    trainx = trainX[train]
    trainy = trainY[train]
    testx = trainX[test]
    testy = trainY[test], trainy)
    predy = rfclf.predict(testx)
    print(accuracy_score(testy, predy))

Then get important features. It seems a little bit tricky. And picked index of high solubility molecule.

fimportance = rfclf.feature_importances_
fimportance_dict = dict(zip(range(1024), fimportance))
sorteddata = sorted(fimportance_dict.items(), key=lambda x:-x[1])
top50feat = [x[0] for x in sorteddata][:50]
solblemol_idx = []
for i, v in enumerate(testY):
    if v == 2:

Now ready.

testidx = solblemol_idx[-1] 


Get fingerprint and bit information and extract important features from onBits.

fp, bitinfo = mol2fp(testmols[testidx])
onbit = [bit for bit in bitinfo.keys()]
importantonbits = list(set(onbit) & set(top50feat))

Finally visualize important bit information with structures.

tpls = [(testmols[testidx], x, bitinfo) for x in importantonbits]
Draw.DrawMorganBits(tpls, legends = [str(x) for x in importantonbits])

This classifier can detect substructures with oxygen atom and sp3 carbon.
The example molecule has amino functional groups but the predictor does not detect nitrogen as an important group.
I think this is because the model is not well optimized.
But it is useful that visualize features of molecule.
All code is uploaded to google colab.
Readers who are interested the code. Can run the code on the colab!