I’m interested in machine learning(ML) for QSAR.
RandomForest, Neural Network, SVM etc. are often useful tool for drug design. But it difficult to interpretation of results for chemist.
So, visualization / interpretation of SAR is critical for communication.
You know, Prof. J. Bajorath reported very nice solution of the problem.
http://www.ncbi.nlm.nih.gov/pubmed/25988274
The article was impressive for me and I want to do in RDKit.
Fortunately, RDKit provides cool function named ‘GetSimilarityMapsForModel’. ;-)
The function can produce figure based on prediction models.
Hmm, sounds nice. Let’s write code!
Sample snippet is following.
At first prepare dataset for play.
# import numpy as np from matplotlib import cm from rdkit import Chem from rdkit.Chem import AllChem, DataStructs from rdkit.Chem.Draw import SimilarityMaps from sklearn.svm import SVC from sklearn.cross_validation import train_test_split dataset = open( "data_cdk.txt", "r" ) # remove header dataset = [ line.rstrip().split("\t") for line in dataset ][1:] mols = [ Chem.MolFromSmiles( line[1] ) for line in dataset ] Y = [] # label active / non active for line in dataset: if float( line[2] ) > 7.0: Y.append( 1 ) else: Y.append( -1 ) Y = np.asarray( Y ) trainmols, testmols, trainy, testy = train_test_split( mols, Y, test_size = 0.1, random_state=90 ) # calc fingerprint trainfps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in trainmols ] testfps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in testmols ] trainx = [] for fp in trainfps: arr = np.zeros( (1,) ) DataStructs.ConvertToNumpyArray( fp, arr ) trainx.append( arr ) testx = [] for fp in testfps: arr = np.zeros( (1,) ) DataStructs.ConvertToNumpyArray( fp, arr ) testx.append( arr )
Next, build SVC model.
cls = SVC( probability=True, C=100 ) cls.fit( trainx, trainy )
Finally, make helper function and do prediction.
I like bwr colormap in matplotlib.
def getProba( fp, probabilityFunc ): return probabilityFunc( fp )[0][1] def mapperfunc( mol ): fig, weight = SimilarityMaps.GetSimilarityMapForModel( mol, SimilarityMaps.GetMorganFingerprint, lambda x: getProba( x, cls.predict_proba), colorMap=cm.bwr ) fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2 ) arr = np.zeros((1,)) DataStructs.ConvertToNumpyArray( fp, arr ) res = cls.predict( arr ) smi = Chem.MolToSmiles( mol ) print(res[0]) if res[0] == 1: fig.savefig( "res/act_"+smi+"_.png", bbox_inches = "tight" ) else: fig.savefig("res/nonact_"+smi+"_.png", bbox_inches = "tight" ) for mol in testmols: mapperfunc( mol ) res= cls.predict(testx) from sklearn.metrics import confusion_matrix print(confusion_matrix(testy, res))
confusion matrix is following. Not so bad.
[[10 0]
[ 3 10]]
And I got some png images in res folder.
A molecule that is predicted nonactive .
A molecule that is predicted active .
These images indicated contribution each atoms to activity.
All code and images are uploaded github.
https://github.com/iwatobipen/chemo_info/tree/master/sarviztest
I think it’s easy to understand. And this function is useful because user can use not only SVC but also RF, NN, etc. User can make your own application.
I’ll build more user friendly web app ASAP.
If reader who is interested in my post, good document is provided by Greg Landrum.
http://www.rdkit.org/docs/Cookbook.html
Check it out!!
Hi, I know this post is really old – but I liked it when I saw it the other day and wanted to see if I could use it personally.
I copied the code and tried to rerun it as is, but I get an error in the mapperfunc
“ValueError: Expected 2D array, got 1D array instead:
array=[0. 0. 0. … 0. 0. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.”
It is probably relatively simple error, but I seem to not be able to fix it.
Thank you,
Hi,
Thanks for your comment.
I checked my code and could reproduce the error.
Now I fixed it.
Pls check the code.
https://github.com/iwatobipen/chemo_info/tree/master/sarviztest
Thanks,