I often discuss with other chemist(s) about QSAR.
And sometime they told me …”QSAR is useful tool for drug discovery, but I don’t understand it. Because QSAR model (i.e. ML) is hard to understand why the compound is good ?”
Hmm, I agree his opinion.
SVM, NB, RF etc are very useful but these models are black box. So, it difficult to understand effect of substructures to the moldes.
Jürgen Bajorath et al. challenged to solve the gap and published interesting paper in J. Chem. Inf. Model.
They described in the paper…
understanding why a compound has undesirable ADME cahracterisitcs is just as important as knowing that it(ADME prediction) does.
I like this phrase.
They developed python library named nbvis that depend on scikitlearn and matplotlib.
The library can visualise contribution of each features of vectors.
I think the key point of the method is that the author used MACCSkeys to build model.
Because MACCSkey is easy to understand for chemist.
I wrote demo_code using RDKit.
Sample data was downloaded following ftp.
And added Class properties.(I set active flag “IC50_uM < 0.1 is active”)
At first, I set arguments 'names' and 'groups'.
Then wrote sample script like following.
import numpy as np
from rdkit import Chem
from rdkit.Chem import MACCSkeys
from sklearn.naive_bayes import BernoulliNB
def calc_MACCS_fp( mol ):
mol_fp =list( MACCSkeys.GenMACCSKeys( mol ).GetOnBits() )
mol_fp_vec = np.zeros( 167, )
mol_fp_vec[ mol_fp ] = 1
def make_fp_array( mols ):
fp_array = [ calc_MACCS_fp( mol ) for mol in mols ]
mols = [ mol for mol in Chem.SDMolSupplier( sys.argv ) ]
X = make_fp_array( mols )
Y = [ float(mol.GetProp( "Class" )) for mol in mols ]
model = BernoulliNB( alpha=0.1 )
model.fit( X[1:], Y[1:] )
conditional_probs = np.exp( model.feature_log_prob_ )
prior = np.exp( model.class_log_prior_ )
print 'condtional feature prob', conditional_probs
print 'class prior', prior
nbviz.visualize_model( conditional_probs, prior, names=maccskey.names, groups=maccskey.groups )
nbviz.visualize_prediction( X, conditional_probs, prior, names=maccskey.names, groups=maccskey.groups )
Let,s run script!
modelviz iwatobipen$ python view_model_demo.py mol_viz_demo/cox2_test.sdf
Then two figures generated.
Red and blue colour of circles indicate that positive / negative influence of features and distance indicate that log odds ratio.
The approach is useful for discussion, because the figure provide information to chemists why the model indicate the substructures are effective.
But, it hard for me to visualise each targets….