SAR visualization with RDKit.

I’m interested in machine learning(ML) for QSAR.
RandomForest, Neural Network, SVM etc. are often useful tool for drug design. But it difficult to interpretation of results for chemist.
So, visualization / interpretation of SAR is critical for communication.
You know, Prof. J. Bajorath reported very nice solution of the problem.
http://www.ncbi.nlm.nih.gov/pubmed/25988274
The article was impressive for me and I want to do in RDKit.
Fortunately, RDKit provides cool function named ‘GetSimilarityMapsForModel’. ;-)
The function can produce figure based on prediction models.
Hmm, sounds nice. Let’s write code!
Sample snippet is following.

At first prepare dataset for play.

#
import numpy as np
from matplotlib import cm
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
from rdkit.Chem.Draw import SimilarityMaps
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split

dataset = open( "data_cdk.txt", "r" )
# remove header
dataset = [ line.rstrip().split("\t") for line in dataset ][1:]
mols = [ Chem.MolFromSmiles( line[1] ) for line in dataset ]
Y = []
# label active / non active
for line in dataset:
    if float( line[2] ) > 7.0:
        Y.append( 1 )
    else:
        Y.append( -1 )
Y = np.asarray( Y )

trainmols, testmols, trainy, testy = train_test_split( mols, Y, test_size = 0.1, random_state=90  )
# calc fingerprint
trainfps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in trainmols ]
testfps =  [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in testmols ]

trainx = []
for fp in trainfps:
    arr = np.zeros( (1,) )
    DataStructs.ConvertToNumpyArray( fp, arr )
    trainx.append( arr )

testx = []
for fp in testfps:
    arr = np.zeros( (1,) )
    DataStructs.ConvertToNumpyArray( fp, arr )
    testx.append( arr )

Next, build SVC model.

cls = SVC( probability=True, C=100 )
cls.fit( trainx, trainy )

Finally, make helper function and do prediction.
I like bwr colormap in matplotlib.

def getProba( fp, probabilityFunc ):
    return probabilityFunc( fp )[0][1]

def mapperfunc( mol ):
    fig, weight = SimilarityMaps.GetSimilarityMapForModel( mol, SimilarityMaps.GetMorganFingerprint, lambda x: getProba( x, cls.predict_proba), colorMap=cm.bwr  )
    fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2 )
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray( fp, arr )
    res = cls.predict( arr )
    smi = Chem.MolToSmiles( mol )
    print(res[0])
    if res[0] == 1:
        fig.savefig( "res/act_"+smi+"_.png", bbox_inches = "tight" )
    else:
        fig.savefig("res/nonact_"+smi+"_.png", bbox_inches = "tight" )
for mol in testmols:
    mapperfunc( mol )

res= cls.predict(testx)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(testy, res))

confusion matrix is following. Not so bad.
[[10 0]
[ 3 10]]

And I got some png images in res folder.
A molecule that is predicted nonactive .
nonact
A molecule that is predicted active .
act
These images indicated contribution each atoms to activity.
All code and images are uploaded github.
https://github.com/iwatobipen/chemo_info/tree/master/sarviztest

I think it’s easy to understand. And this function is useful because user can use not only SVC but also RF, NN, etc. User can make your own application.
I’ll build more user friendly web app ASAP.

If reader who is interested in my post, good document is provided by Greg Landrum.
http://www.rdkit.org/docs/Cookbook.html
Check it out!!

Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

2 thoughts on “SAR visualization with RDKit.

  1. Hi, I know this post is really old – but I liked it when I saw it the other day and wanted to see if I could use it personally.
    I copied the code and tried to rerun it as is, but I get an error in the mapperfunc

    “ValueError: Expected 2D array, got 1D array instead:
    array=[0. 0. 0. … 0. 0. 0.].
    Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.”

    It is probably relatively simple error, but I seem to not be able to fix it.
    Thank you,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: