I often discuss with other chemist(s) about QSAR.
And sometime they told me …”QSAR is useful tool for drug discovery, but I don’t understand it. Because QSAR model (i.e. ML) is hard to understand why the compound is good ?”
Hmm, I agree his opinion.
SVM, NB, RF etc are very useful but these models are black box. So, it difficult to understand effect of substructures to the moldes.
Jürgen Bajorath et al. challenged to solve the gap and published interesting paper in J. Chem. Inf. Model.
http://pubs.acs.org/doi/abs/10.1021/ci500410g
They described in the paper…
understanding why a compound has undesirable ADME cahracterisitcs is just as important as knowing that it(ADME prediction) does.
I like this phrase.
They developed python library named nbvis that depend on scikitlearn and matplotlib.
The library can visualise contribution of each features of vectors.
I think the key point of the method is that the author used MACCSkeys to build model.
Because MACCSkey is easy to understand for chemist.
I wrote demo_code using RDKit.
https://github.com/iwatobipen/chemo_info/tree/master/modelviz
Sample data was downloaded following ftp.
ftp://ftp.ics.uci.edu/pub/baldig/learning/Sutherland/
And added Class properties.(I set active flag “IC50_uM < 0.1 is active”)
At first, I set arguments 'names' and 'groups'.
Then wrote sample script like following.
import nbviz import numpy as np import sys import maccskey from rdkit import Chem from rdkit.Chem import MACCSkeys from sklearn.naive_bayes import BernoulliNB def calc_MACCS_fp( mol ): mol_fp =list( MACCSkeys.GenMACCSKeys( mol ).GetOnBits() ) mol_fp_vec = np.zeros( 167, ) mol_fp_vec[ mol_fp ] = 1 return mol_fp_vec def make_fp_array( mols ): fp_array = [ calc_MACCS_fp( mol ) for mol in mols ] return fp_array mols = [ mol for mol in Chem.SDMolSupplier( sys.argv[1] ) ] X = make_fp_array( mols ) Y = [ float(mol.GetProp( "Class" )) for mol in mols ] model = BernoulliNB( alpha=0.1 ) model.fit( X[1:], Y[1:] ) conditional_probs = np.exp( model.feature_log_prob_ ) prior = np.exp( model.class_log_prior_[1] ) print 'condtional feature prob', conditional_probs print 'class prior', prior nbviz.visualize_model( conditional_probs, prior, names=maccskey.names, groups=maccskey.groups ) nbviz.visualize_prediction( X[0], conditional_probs, prior, names=maccskey.names, groups=maccskey.groups )
Let,s run script!
modelviz iwatobipen$ python view_model_demo.py mol_viz_demo/cox2_test.sdf
Then two figures generated.
Red and blue colour of circles indicate that positive / negative influence of features and distance indicate that log odds ratio.
The approach is useful for discussion, because the figure provide information to chemists why the model indicate the substructures are effective.
But, it hard for me to visualise each targets….