QSAR with chainer.

I wrote blog post about DL before.
https://iwatobipen.wordpress.com/2014/07/07/deep-learning-with-python/
I used nolearn for QSAR in the post.
There are some python library for DL. You know… Theano, nolearn, pylearn2, etc..

And chainer is new python library for neural networks.
If reader interested the library, please search chainer in google, you will find lots of site or presentation about it.

Today I used chainer for QSAR.
First, I god dataset from following Link.
http://www.cheminformatics.org/datasets/bursi/

Then make dataset for chainer.

from __future__ import print_function
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
import pandas as pd
import numpy as np
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
import cPickle
#read mols.
ms = [ m for m in Chem.SDMolSupplier("cas_4337.sdf") if m != None]
#define fingerprint calculator.
def calcfp(mol):
    arr = np.zeros((1,))
    fp = AllChem.GetMorganFingerprintAsBitVect( mol,2 )
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
#convert mutatgen flag to int flag.
toxcls = []
for mol in ms:
    if mol.GetProp("Ames test categorisation") == "mutagen":
        toxcls.append(1)
    else:
        toxcls.append(0)

#sorry...I did not use np.vectrize function. 😉   
data['fp']=data.ROMol.apply(calcfp)
dataset = [np_fps, toxcls]
np_fps = list(data.fp)

# data split for QSAR.
fp_train, fp_test, cls_train, cls_test = train_test_split(np_fps, toxcls, test_size=0.8, random_state=794)
#save data
f = open('dataset.pkl','wb')
cPickle.dump([fp_train, fp_test, cls_train, cls_test],f)
f.close()

Now I got dataset for training and test.
Then I tried to build model and predict using chainer.
I’m using chainer official site as reference, so almost code is same.
For chainer X must be float32 and Y must be int.
Definition of NN models is very simple code.
Finger print is 2048bit, so input layer is 2048.
Output is binary (mutagen or not mutagen), so last point set 2.

Following code, I used rectifier(relu) as activation function, but user can use any other function for example hyperbolic tangent(tanh).

import cPickle
import numpy as np
from chainer import cuda, Function, FunctionSet, gradient_check, Variable, utils
import chainer.functions as F
from chainer import optimizers

fobj = open('dataset.pkl', 'rb')
dataset = cPickle.load(fobj)

test_x= np.array( dataset[0], dtype=np.float32 )
train_x=np.array( dataset[1], dtype=np.float32 )
test_y=np.array( dataset[2], dtype=np.int32 )
train_y=np.array( dataset[3], dtype=np.int32 )

model = FunctionSet( 
	l1 = F.Linear(2048,500),
	l2 = F.Linear(500,100),
	l3 = F.Linear(100,2))

optimizer = optimizers.SGD()
optimizer.setup(model.collect_parameters())


def forward( x_data, y_data ):
    x = Variable(x_data)
    t = Variable(y_data)
    h1 = F.relu( model.l1(x) )
    h2 = F.relu( model.l2(h1) )
    y=model.l3(h2)
    return F.softmax_cross_entropy(y, t), F.accuracy( y, t )


batchsize=30
datasize = len(train_y)

log_f = open('train_log.txt', 'w')
counter = 0
for epoch in range(100):
    print("epoch %d"%epoch)
    indexes = np.random.permutation( datasize )
    for i in range(0, datasize, batchsize):
    	counter +=1
        xbatch = train_x[  indexes[i: i+batchsize] ]
        ybatch = train_y[  indexes[i: i+batchsize] ]
        optimizer.zero_grads()
        loss, accuracy = forward(xbatch,ybatch)
        loss.backward()
        log_f.write("%s,%s,%s,%s,%s\n"%(counter,epoch,i, loss.data[0],accuracy.data ))
        optimizer.update()
log_f.close()
sum_loss, sum_accuracy = 0, 0
for i in range(0, len(test_y) , batchsize ):
	x_batch = test_x[i : i+batchsize ]
	y_batch = test_y[i : i+batchsize ]
	loss, accuracy = forward(x_batch, y_batch)
	sum_loss += loss.data * batchsize
	sum_accuracy += accuracy.data * batchsize
mean_loss = sum_loss / len(test_y)
mean_accuracy = sum_accuracy / len(test_y)
print(mean_loss)
print(mean_accuracy)

Run the code…
I got logfile and out put for test.

iwatobipen$ python cahinertester.py
epoch 0
...
epoch 99
[ 0.69697642]
0.829680887886

Loss of test was high…;-( but showed good acuracy.
…over fitting?

I checked progress of training.
It seems to convergence after 10K cycles.
fig1.001

I chainer is good tool for pythonista.
But…
I considered about DL is really useful for QSAR ?
Fingerprint or molecular descriptor are represent feature of active or not active for target ?
Which is more suitable, 2D or 3D?
Hmm….

広告

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中