QSAR with chainer.

I wrote blog post about DL before.
I used nolearn for QSAR in the post.
There are some python library for DL. You know… Theano, nolearn, pylearn2, etc..

And chainer is new python library for neural networks.
If reader interested the library, please search chainer in google, you will find lots of site or presentation about it.

Today I used chainer for QSAR.
First, I god dataset from following Link.

Then make dataset for chainer.

from __future__ import print_function
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
import pandas as pd
import numpy as np
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
import cPickle
#read mols.
ms = [ m for m in Chem.SDMolSupplier("cas_4337.sdf") if m != None]
#define fingerprint calculator.
def calcfp(mol):
    arr = np.zeros((1,))
    fp = AllChem.GetMorganFingerprintAsBitVect( mol,2 )
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
#convert mutatgen flag to int flag.
toxcls = []
for mol in ms:
    if mol.GetProp("Ames test categorisation") == "mutagen":

#sorry...I did not use np.vectrize function. ;-)   
dataset = [np_fps, toxcls]
np_fps = list(data.fp)

# data split for QSAR.
fp_train, fp_test, cls_train, cls_test = train_test_split(np_fps, toxcls, test_size=0.8, random_state=794)
#save data
f = open('dataset.pkl','wb')
cPickle.dump([fp_train, fp_test, cls_train, cls_test],f)

Now I got dataset for training and test.
Then I tried to build model and predict using chainer.
I’m using chainer official site as reference, so almost code is same.
For chainer X must be float32 and Y must be int.
Definition of NN models is very simple code.
Finger print is 2048bit, so input layer is 2048.
Output is binary (mutagen or not mutagen), so last point set 2.

Following code, I used rectifier(relu) as activation function, but user can use any other function for example hyperbolic tangent(tanh).

import cPickle
import numpy as np
from chainer import cuda, Function, FunctionSet, gradient_check, Variable, utils
import chainer.functions as F
from chainer import optimizers

fobj = open('dataset.pkl', 'rb')
dataset = cPickle.load(fobj)

test_x= np.array( dataset[0], dtype=np.float32 )
train_x=np.array( dataset[1], dtype=np.float32 )
test_y=np.array( dataset[2], dtype=np.int32 )
train_y=np.array( dataset[3], dtype=np.int32 )

model = FunctionSet( 
	l1 = F.Linear(2048,500),
	l2 = F.Linear(500,100),
	l3 = F.Linear(100,2))

optimizer = optimizers.SGD()

def forward( x_data, y_data ):
    x = Variable(x_data)
    t = Variable(y_data)
    h1 = F.relu( model.l1(x) )
    h2 = F.relu( model.l2(h1) )
    return F.softmax_cross_entropy(y, t), F.accuracy( y, t )

datasize = len(train_y)

log_f = open('train_log.txt', 'w')
counter = 0
for epoch in range(100):
    print("epoch %d"%epoch)
    indexes = np.random.permutation( datasize )
    for i in range(0, datasize, batchsize):
    	counter +=1
        xbatch = train_x[  indexes[i: i+batchsize] ]
        ybatch = train_y[  indexes[i: i+batchsize] ]
        loss, accuracy = forward(xbatch,ybatch)
        log_f.write("%s,%s,%s,%s,%s\n"%(counter,epoch,i, loss.data[0],accuracy.data ))
sum_loss, sum_accuracy = 0, 0
for i in range(0, len(test_y) , batchsize ):
	x_batch = test_x[i : i+batchsize ]
	y_batch = test_y[i : i+batchsize ]
	loss, accuracy = forward(x_batch, y_batch)
	sum_loss += loss.data * batchsize
	sum_accuracy += accuracy.data * batchsize
mean_loss = sum_loss / len(test_y)
mean_accuracy = sum_accuracy / len(test_y)

Run the code…
I got logfile and out put for test.

iwatobipen$ python cahinertester.py
epoch 0
epoch 99
[ 0.69697642]

Loss of test was high…;-( but showed good acuracy.
…over fitting?

I checked progress of training.
It seems to convergence after 10K cycles.

I chainer is good tool for pythonista.
I considered about DL is really useful for QSAR ?
Fingerprint or molecular descriptor are represent feature of active or not active for target ?
Which is more suitable, 2D or 3D?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.