Deep learning is old but new technology of machine learning. I have been interested in this technology however it was difficult to optimise lots of parameters.
I posted on my blog about python library named ‘chainer’ before. Chainer is one of flexible frame work for NN.
And somedays ago, I found new library named ‘keras’
http://keras.io/
The library uses theano or tensorflow. And user be able to develop code more simply.
So, I coded sample code to test the library.
First, I made sample dataset from Chembl hERG assay.
Input layer was calculated by rdkit as ECFP.
I wrote following code python3.x env.
from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import DataStructs from sklearn.cross_validation import train_test_split import pandas as pd import numpy as np import pickle df = pd.read_table('bioactivity-15_13-09-28.txt', header=0) df_bind = df[ df.ASSAY_TYPE=="B" ] df_bind = df_bind[ df_bind.STANDARD_VALUE != None ] rows = df_bind.shape[ 0 ] mols = [ ] act = [ ] fps = [] def act2bin( val ): if val > 10000: return 0 else: return 1 for i in range( rows ): try: smi = df_bind.CANONICAL_SMILES[i] mol = Chem.MolFromSmiles( smi ) if mol != None: mols.append( mol ) act.append( act2bin( df_bind.STANDARD_VALUE[i]) ) else: pass except: pass for mol in mols: arr = np.zeros( (1,) ) fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=256 ) DataStructs.ConvertToNumpyArray( fp, arr ) fps.append( fp ) fps = np.array( fps, dtype = np.float32 ) act = np.array( act, dtype = np.int32 ) #act = np.array(act, dtype=np.float32) train_x, test_x, train_y, test_y = train_test_split( fps, act, test_size=0.3, random_state=455 ) f = open('dataset_fp256.pkl', 'wb') pickle.dump( [train_x, train_y, test_x, test_y ], f ) f.close()
Then I made NN model using keras. (keras is easily installed by pip.)
First, I need to set model class, keras has two model class, Sequential, Graph. I used first one.
Then, define the model structure.
I think the manner is very easy to understand, I set each layer using ‘model.add(….)’.
Drop out is also handled as a layer.
Of course, keras has many activation function. Tanh, ReLu, Sigmoid, etc.
After define the model, user need to compile the model.
That’s all!
Fit function can build the predictive model! It’s like scikit-learn ;-).
When user set validation_split argument, keras automatically split train and validation data. Cool!
Finally to predict the data, model.evaluate() was used.
It seems very simple way.
And interesting point was that, when running the code, user can check the progress of learning through the console.
Like that.
3500/4861 [====================>.........] - ETA: 0s - loss: 0.5465 - acc: 0.7146 4000/4861 [=======================>......] - ETA: 0s - loss: 0.5412 - acc: 0.7195 4500/4861 [==========================>...] - ETA: 0s - loss: 0.5358 - acc: 0.7231 4861/4861 [==============================] - 0s - loss: 0.5390 - acc: 0.7219 - val_loss: 0.5399 - val_acc: 0.7338 Epoch 97/500 500/4861 [==>...........................] - ETA: 0s - loss: 0.5182 - acc: 0.7360 1000/4861 [=====>........................] - ETA: 0s - loss: 0.5208 - acc: 0.7240 1500/4861 [========>.....................] - ETA: 0s - loss: 0.5250 - acc: 0.7240 2000/4861 [===========>..................] - ETA: 0s - loss: 0.5311 - acc: 0.7240 2500/4861 [==============>...............] - ETA: 0s - loss: 0.5264 - acc: 0.7280 3000/4861 [=================>............] - ETA: 0s - loss: 0.5302 - acc: 0.7240 3500/4861 [====================>.........] - ETA: 0s - loss: 0.5374 - acc: 0.7191 4000/4861 [=======================>......] - ETA: 0s - loss: 0.5411 - acc: 0.7180 4500/4861 [==========================>...] - ETA: 0s - loss: 0.5396 - acc: 0.7202 4861/4861 [==============================] - 0s - loss: 0.5402 - acc: 0.7194 - val_loss: 0.5379 - val_acc: 0.7412 Epoch 98/500
And when user can use GPU, type command like
iwatobipen$ time THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python hogehoge.py
Training history was stored in model.fit object and user can call history later.
It was useful for new model development.
However I could not optimise model yet. Model response is very sensitive to the parameters.
(In my case, too many bits, ReLu function, too many number of layers setting was not worked well.)
Is training data too small to build model?
If you have any ideas, I’d like to hear them.
import pickle import matplotlib import matplotlib.pyplot as plt from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation from keras.optimizers import SGD, Adagrad nb_epoch = 500 f = open( 'dataset_fp256.pkl', 'rb' ) trainx, trainy, testx, testy = pickle.load( f ) model = Sequential() model.add( Dense( output_dim = 50, init='uniform', input_dim=256 ) ) model.add( Activation( 'tanh' ) ) #model.add(Dropout(0.5)) model.add( Dense( 1 ) ) model.add( Activation( 'sigmoid' ) ) model.compile( optimizer='adagrad', loss='binary_crossentropy' ) hist = model.fit( trainx, trainy , nb_epoch=nb_epoch, batch_size=500, validation_split=0.1, show_accuracy=True) score = model.evaluate( testx, testy, show_accuracy=True, verbose=1 ) print( ' testscore: ', score[0], ' testaccuracy: ', score[1] ) loss = hist.history[ 'loss' ] val_loss = hist.history[ 'val_loss' ] plt.plot( range( nb_epoch ), loss, label='loss' ) plt.plot( range( nb_epoch ), val_loss, label = 'val_loss' ) plt.legend( loc='best', fontsize=10 ) plt.grid() plt.xlabel( 'epoch' ) plt.ylabel( 'loss' ) plt.savefig( 'res.png' )