SAR visualization with RDKit.

I’m interested in machine learning(ML) for QSAR.
RandomForest, Neural Network, SVM etc. are often useful tool for drug design. But it difficult to interpretation of results for chemist.
So, visualization / interpretation of SAR is critical for communication.
You know, Prof. J. Bajorath reported very nice solution of the problem.
The article was impressive for me and I want to do in RDKit.
Fortunately, RDKit provides cool function named ‘GetSimilarityMapsForModel’. 😉
The function can produce figure based on prediction models.
Hmm, sounds nice. Let’s write code!
Sample snippet is following.

At first prepare dataset for play.

import numpy as np
from matplotlib import cm
from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs
from rdkit.Chem.Draw import SimilarityMaps
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split

dataset = open( "data_cdk.txt", "r" )
# remove header
dataset = [ line.rstrip().split("\t") for line in dataset ][1:]
mols = [ Chem.MolFromSmiles( line[1] ) for line in dataset ]
Y = []
# label active / non active
for line in dataset:
    if float( line[2] ) > 7.0:
        Y.append( 1 )
        Y.append( -1 )
Y = np.asarray( Y )

trainmols, testmols, trainy, testy = train_test_split( mols, Y, test_size = 0.1, random_state=90  )
# calc fingerprint
trainfps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in trainmols ]
testfps =  [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in testmols ]

trainx = []
for fp in trainfps:
    arr = np.zeros( (1,) )
    DataStructs.ConvertToNumpyArray( fp, arr )
    trainx.append( arr )

testx = []
for fp in testfps:
    arr = np.zeros( (1,) )
    DataStructs.ConvertToNumpyArray( fp, arr )
    testx.append( arr )

Next, build SVC model.

cls = SVC( probability=True, C=100 ) trainx, trainy )

Finally, make helper function and do prediction.
I like bwr colormap in matplotlib.

def getProba( fp, probabilityFunc ):
    return probabilityFunc( fp )[0][1]

def mapperfunc( mol ):
    fig, weight = SimilarityMaps.GetSimilarityMapForModel( mol, SimilarityMaps.GetMorganFingerprint, lambda x: getProba( x, cls.predict_proba), colorMap=cm.bwr  )
    fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2 )
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray( fp, arr )
    res = cls.predict( arr )
    smi = Chem.MolToSmiles( mol )
    if res[0] == 1:
        fig.savefig( "res/act_"+smi+"_.png", bbox_inches = "tight" )
        fig.savefig("res/nonact_"+smi+"_.png", bbox_inches = "tight" )
for mol in testmols:
    mapperfunc( mol )

res= cls.predict(testx)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(testy, res))

confusion matrix is following. Not so bad.
[[10 0]
[ 3 10]]

And I got some png images in res folder.
A molecule that is predicted nonactive .
A molecule that is predicted active .
These images indicated contribution each atoms to activity.
All code and images are uploaded github.

I think it’s easy to understand. And this function is useful because user can use not only SVC but also RF, NN, etc. User can make your own application.
I’ll build more user friendly web app ASAP.

If reader who is interested in my post, good document is provided by Greg Landrum.
Check it out!!



以下に詳細を記入するか、アイコンをクリックしてログインしてください。 ロゴ アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中