Compare the view point of different QSAR models #RDKit #visualize #chemoinformatics

Some days ago, I posted how to visualize SVG images horizontally and ngboost for QSAR problem. It worked well. And I found that different models showed different performance. So my question is that which point each model detects important for molecular properties.

Fortunately rdkit has GetSimilarityMapForModel method which can render the probe molecule with model’s feature importance.

Today I tried to combine previous method and this useful rdkit method to compare the models feature. At first import required packages.

import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
import numpy as np
import io
from PIL import Image
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from ngboost import NGBClassifier
from ngboost.ngboost import NGBoost
from ngboost.learners import default_tree_learner
from ngboost.distns import Normal, LogNormal
from ngboost.distns import k_categorical
from sklearn.metrics import classification_report
from functools import partial
from rdkit.Chem import Draw
from IPython.display import SVG

Then read train and test data, calculate fingerprint and get target values.

train = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
train_mols = [m for m in Chem.SDMolSupplier(train)]
test = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
test_mols = [m for m in Chem.SDMolSupplier(test)]

val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}

def mol2arr(mol, radi=2, nBits=1024):
    arr = np.zeros((1,))
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radi, nBits)
    DataStructs.ConvertToNumpyArray(fp, arr)

    return arr

trainX = np.array([mol2arr(m) for m in train_mols])
trainY = np.array([val_dict[m.GetProp('SOL_classification')] for m in train_mols])

testX = np.array([mol2arr(m) for m in test_mols])
testY = np.array([val_dict[m.GetProp('SOL_classification')] for m in test_mols])

Next, define classifier models. And train them.

adbcls = AdaBoostClassifier()
svc = SVC(probability=True)
ngbc = NGBClassifier(Dist=k_categorical(3))
clsset = [adbcls, svc, ngbc]
for clsfier in clsset:, trainY)

Then I made subgroup which is grouped by solubility class.

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']

Then I defined drawmol function, the function makes SVG text of similaritymap for model. I passed MolDraw2DSVG to draw2d options, it means that GetSimilarityMapForModel draw the output to SVG drawer. And defined HorizontalDisplay class as same as previous post.

def drawmol(mols, idx, predictfunc):
    d = Draw.MolDraw2DSVG(200,250)
    fig, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, predictfunc.predict_proba),
                                                           draw2d = d,
                                                           colorMap='coolwarm', #this option wasn't applied why?? 
    txt = d.GetDrawingText()
    return txt

class HorizontalDisplay:
    def __init__(self, svgtxtlist):
        self.svgtxtlist = svgtxtlist

    def _repr_html_(self):
        template = '<div style="float:left;padding:10px;">{0}</div>'
        return "\n".join(template.format(svgtxt)
                         for svgtxt in self.svgtxtlist)

Now ready. Let’s display molecules with these methods.

Following example are selected from high/low test molecules.

#Adaboost, SVC, NGBOOST
svgset = []
for cls in clsset:
svgset.append(drawmol(high_test_mols, 3, cls))

Adaboost classifier strongly responded tert-methyl group compared to other models. Ngboost classifier says methyl groups are not good for solubility.

Next example, Adaboost couldn’t phenol OH groups are positive for solubility but SVC and NGBoost said OH is important for solubility. But these two modes showed different response to aromatic carbon.

By using the method, user can reveal that where is important point for the model. So it will be good material for discussion with chemists I think.

But the method is open to some debate. Coloring, normalize atom features, etc…

QSAR model should not be used as just a black box which return positive or negative, it should be used for discussion and good decision making.

Today’s code can be found following URL. Thanks for reading.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
view raw ngboost2.ipynb hosted with ❤ by GitHub

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: