build QSAR model using RDKit

I’m interested in deep learning.
Some days ago, I read following paper.
Prediction of New Bioactive Molecules using a Bayesian Belief Network
The author shows Bayesian belief network for classification (BBNC) method is a useful addition to the computational chemist’s toolbox.
So, Today I tried to write script that build qsar models.

At first, calculate molecular descriptors.
The code is follows…

import sys, cPickle
import numpy as np
from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import Descriptors
from rdkit.ML.Descriptors import MoleculeDescriptors
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()

trainset = sys.argv[1]
testset = sys.argv[2]
trainset = [mol for mol in Chem.SDMolSupplier(trainset) if mol is not None]
testset = [mol for mol in Chem.SDMolSupplier(testset) if mol is not None]

nms=[x[0] for x in Descriptors._descList]
calc = MoleculeDescriptors.MolecularDescriptorCalculator(nms)

trainDescrs = [calc.CalcDescriptors(x) for x in trainset]
testDescrs  = [calc.CalcDescriptors(x) for x in testset]
trainDescrs = np.array(trainDescrs)
testDescrs = np.array(testDescrs)

x_train_minmax = min_max_scaler.fit_transform( trainDescrs )
x_test_minmax = min_max_scaler.fit_transform( testDescrs )

classes={'(A) low':0,'(B) medium':1,'(C) high':1}
train_acts = np.array([classes[mol.GetProp("SOL_classification")] for mol in trainset],dtype="int")
test_acts = np.array([classes[mol.GetProp("SOL_classification")] for mol in testset],dtype="int")

dataset = ( (x_train_minmax, train_acts),(x_train_minmax, train_acts), (x_test_minmax, test_acts) )

f = open("rdk_sol_set_norm_descs.pkl", "wb")

Now I could get train and test data set as pkl file.
Next, build the model using scikit-learn
The code build the model using RANDOMFOREST, SVM, Naive Bayes, Ristrict Bollzmann-SVM classifiler(RBS).
Scikit-learn can join RBM-SVM using pipeline method.
Model can save as pkl file using cPicke. (following code print results only. ;-) )
Scikit-learn is very simple to use, and powerful.
I posted same code and example files (that from RDKit ) to here.

import sys, cPickle
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
from sklearn import metrics
from sklearn.neural_network import BernoulliRBM
from sklearn.pipeline import Pipeline

f = open(sys.argv[1], "rb")
train, valid, test = cPickle.load(f)

train_x, train_y = train
test_x, test_y = test

nclf = RandomForestClassifier( n_estimators=100, max_depth=5, random_state=0, n_jobs=1 )
nclf = train_x, train_y )
preds = nclf.predict( test_x )
print metrics.confusion_matrix(test_y, preds)
print metrics.classification_report(test_y, preds)
accuracy = nclf.score(test_x, test_y)
print accuracy

print "SVM"
clf_svm = svm.SVC( gamma=0.001, C=100. )
clf_svm = train_x, train_y )
preds_SVM = clf_svm.predict( test_x )
print metrics.confusion_matrix( test_y, preds_SVM )
print metrics.classification_report( test_y, preds_SVM )
accuracy = clf_svm.score( test_x, test_y )

print accuracy

print "NB"
gnb = GaussianNB()
clf_NB = train_x, train_y )
preds_NB = clf_NB.predict( test_x )
print metrics.confusion_matrix( test_y, preds_NB )
print metrics.classification_report( test_y, preds_NB )

#accuracy = preds_NB.score( test_x, test_y )
#print accuracy

print "RBM"
cls_svm2 = svm.SVC( gamma=0.001, C=100. )
rbm = BernoulliRBM(random_state = 0, verbose = True)
classifier = Pipeline( steps=[("rbm", rbm), ("cls_svm2", cls_svm2)] )
rbm.learning_rate = 0.06
rbm.n_iter = 20
rbm.n_compornents = 1000, train_y)
pred_RBM = classifier.predict(test_x)
print metrics.confusion_matrix(test_y, pred_RBM)
print metrics.classification_report(test_y, pred_RBM)
accuracy = classifier.score( test_x, test_y )
print accuracy

One thought on “build QSAR model using RDKit

  1. […] There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch. I have tried Keras, Cahiner and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit. You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer. I used solubility data that is provided from rdkit and I used the dataset before. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.