New library for deep learning

Deep learning is old but new technology of machine learning. I have been interested in this technology however it was difficult to optimise lots of parameters.

I posted on my blog about python library named ‘chainer’ before. Chainer is one of flexible frame work for NN.
And somedays ago, I found new library named ‘keras’
http://keras.io/
The library uses theano or tensorflow. And user be able to develop code more simply.
So, I coded sample code to test the library.
First, I made sample dataset from Chembl hERG assay.
Input layer was calculated by rdkit as ECFP.
I wrote following code python3.x env.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np
import pickle

df = pd.read_table('bioactivity-15_13-09-28.txt', header=0)
df_bind = df[ df.ASSAY_TYPE=="B" ]
df_bind = df_bind[ df_bind.STANDARD_VALUE != None ]
rows = df_bind.shape[ 0 ]
mols = [ ]
act = [ ]
fps = []
def act2bin( val ):
    if val > 10000:
        return 0
    else:
        return 1

for i in range( rows ):
    try:
        smi = df_bind.CANONICAL_SMILES[i]
        mol = Chem.MolFromSmiles( smi )
        if mol != None:
            mols.append( mol )
            act.append( act2bin( df_bind.STANDARD_VALUE[i]) )
        else:
            pass
    except:
        pass
for mol in mols:
    arr = np.zeros( (1,) )
    fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=256 )
    DataStructs.ConvertToNumpyArray( fp, arr )
    fps.append( fp )

fps = np.array( fps, dtype = np.float32 )
act = np.array( act, dtype = np.int32 )
#act = np.array(act, dtype=np.float32)
train_x, test_x, train_y, test_y = train_test_split( fps, act, test_size=0.3, random_state=455 )

f = open('dataset_fp256.pkl', 'wb')
pickle.dump( [train_x, train_y, test_x, test_y ], f )
f.close()

Then I made NN model using keras. (keras is easily installed by pip.)
First, I need to set model class, keras has two model class, Sequential, Graph. I used first one.
Then, define the model structure.
I think the manner is very easy to understand, I set each layer using ‘model.add(….)’.
Drop out is also handled as a layer.
Of course, keras has many activation function. Tanh, ReLu, Sigmoid, etc.
After define the model, user need to compile the model.
That’s all!
Fit function can build the predictive model! It’s like scikit-learn ;-).
When user set validation_split argument, keras automatically split train and validation data. Cool!
Finally to predict the data, model.evaluate() was used.
It seems very simple way.
And interesting point was that, when running the code, user can check the progress of learning through the console.
Like that.

3500/4861 [====================>.........] - ETA: 0s - loss: 0.5465 - acc: 0.7146
4000/4861 [=======================>......] - ETA: 0s - loss: 0.5412 - acc: 0.7195
4500/4861 [==========================>...] - ETA: 0s - loss: 0.5358 - acc: 0.7231
4861/4861 [==============================] - 0s - loss: 0.5390 - acc: 0.7219 - val_loss: 0.5399 - val_acc: 0.7338
Epoch 97/500

 500/4861 [==>...........................] - ETA: 0s - loss: 0.5182 - acc: 0.7360
1000/4861 [=====>........................] - ETA: 0s - loss: 0.5208 - acc: 0.7240
1500/4861 [========>.....................] - ETA: 0s - loss: 0.5250 - acc: 0.7240
2000/4861 [===========>..................] - ETA: 0s - loss: 0.5311 - acc: 0.7240
2500/4861 [==============>...............] - ETA: 0s - loss: 0.5264 - acc: 0.7280
3000/4861 [=================>............] - ETA: 0s - loss: 0.5302 - acc: 0.7240
3500/4861 [====================>.........] - ETA: 0s - loss: 0.5374 - acc: 0.7191
4000/4861 [=======================>......] - ETA: 0s - loss: 0.5411 - acc: 0.7180
4500/4861 [==========================>...] - ETA: 0s - loss: 0.5396 - acc: 0.7202
4861/4861 [==============================] - 0s - loss: 0.5402 - acc: 0.7194 - val_loss: 0.5379 - val_acc: 0.7412
Epoch 98/500

And when user can use GPU, type command like
iwatobipen$ time THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python hogehoge.py
Training history was stored in model.fit object and user can call history later.
It was useful for new model development.
However I could not optimise model yet. Model response is very sensitive to the parameters.
(In my case, too many bits, ReLu function, too many number of layers setting was not worked well.)
Is training data too small to build model?

If you have any ideas, I’d like to hear them.

import pickle
import matplotlib
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adagrad
nb_epoch = 500
f = open( 'dataset_fp256.pkl', 'rb' )
trainx, trainy, testx, testy = pickle.load( f )

model = Sequential()
model.add( Dense( output_dim = 50, init='uniform', input_dim=256 ) )
model.add( Activation( 'tanh' ) )
#model.add(Dropout(0.5))
model.add( Dense( 1 ) )
model.add( Activation( 'sigmoid' ) )

model.compile( optimizer='adagrad', loss='binary_crossentropy' )

hist = model.fit( trainx, trainy ,
                     nb_epoch=nb_epoch,
                     batch_size=500,
                     validation_split=0.1,
                     show_accuracy=True)

score = model.evaluate( testx, testy, show_accuracy=True, verbose=1 )
print( ' testscore: ', score[0], ' testaccuracy: ', score[1] )

loss = hist.history[ 'loss' ]
val_loss = hist.history[ 'val_loss' ]
plt.plot( range( nb_epoch ), loss, label='loss' )
plt.plot( range( nb_epoch ), val_loss,  label = 'val_loss' )
plt.legend( loc='best', fontsize=10 )
plt.grid()
plt.xlabel( 'epoch' )
plt.ylabel( 'loss' )
plt.savefig( 'res.png' )

res_gpu


Advertisement

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: