SAR visualization with RDKit.

I’m interested in machine learning(ML) for QSAR.
RandomForest, Neural Network, SVM etc. are often useful tool for drug design. But it difficult to interpretation of results for chemist.
So, visualization / interpretation of SAR is critical for communication.
You know, Prof. J. Bajorath reported very nice solution of the problem.
http://www.ncbi.nlm.nih.gov/pubmed/25988274
The article was impressive for me and I want to do in RDKit.
Fortunately, RDKit provides cool function named ‘GetSimilarityMapsForModel’. 😉
The function can produce figure based on prediction models.
Hmm, sounds nice. Let’s write code!
Sample snippet is following.

At first prepare dataset for play.

#
import numpy as np
from matplotlib import cm
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
from rdkit.Chem.Draw import SimilarityMaps
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split

dataset = open( "data_cdk.txt", "r" )
# remove header
dataset = [ line.rstrip().split("\t") for line in dataset ][1:]
mols = [ Chem.MolFromSmiles( line[1] ) for line in dataset ]
Y = []
# label active / non active
for line in dataset:
    if float( line[2] ) > 7.0:
        Y.append( 1 )
    else:
        Y.append( -1 )
Y = np.asarray( Y )

trainmols, testmols, trainy, testy = train_test_split( mols, Y, test_size = 0.1, random_state=90  )
# calc fingerprint
trainfps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in trainmols ]
testfps =  [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in testmols ]

trainx = []
for fp in trainfps:
    arr = np.zeros( (1,) )
    DataStructs.ConvertToNumpyArray( fp, arr )
    trainx.append( arr )

testx = []
for fp in testfps:
    arr = np.zeros( (1,) )
    DataStructs.ConvertToNumpyArray( fp, arr )
    testx.append( arr )

Next, build SVC model.

cls = SVC( probability=True, C=100 )
cls.fit( trainx, trainy )

Finally, make helper function and do prediction.
I like bwr colormap in matplotlib.

def getProba( fp, probabilityFunc ):
    return probabilityFunc( fp )[0][1]

def mapperfunc( mol ):
    fig, weight = SimilarityMaps.GetSimilarityMapForModel( mol, SimilarityMaps.GetMorganFingerprint, lambda x: getProba( x, cls.predict_proba), colorMap=cm.bwr  )
    fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2 )
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray( fp, arr )
    res = cls.predict( arr )
    smi = Chem.MolToSmiles( mol )
    print(res[0])
    if res[0] == 1:
        fig.savefig( "res/act_"+smi+"_.png", bbox_inches = "tight" )
    else:
        fig.savefig("res/nonact_"+smi+"_.png", bbox_inches = "tight" )
for mol in testmols:
    mapperfunc( mol )

res= cls.predict(testx)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(testy, res))

confusion matrix is following. Not so bad.
[[10 0]
[ 3 10]]

And I got some png images in res folder.
A molecule that is predicted nonactive .
nonact
A molecule that is predicted active .
act
These images indicated contribution each atoms to activity.
All code and images are uploaded github.
https://github.com/iwatobipen/chemo_info/tree/master/sarviztest

I think it’s easy to understand. And this function is useful because user can use not only SVC but also RF, NN, etc. User can make your own application.
I’ll build more user friendly web app ASAP.

If reader who is interested in my post, good document is provided by Greg Landrum.
http://www.rdkit.org/docs/Cookbook.html
Check it out!!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s