Callback function of keras.

I’m still building QSAR models using deep learning. And I thought I got problem of over fitting. 🙂
Training error was decreasing but, validation error was increasing depend on number of epochs. :-/
It seems over fitting and I could not avoid the event even if I used drop out function.

Tried lots of learning conditions but all challenge was failed…..finally I thought that too long learning did not have good effect for QSAR.
I thought reason why over fitting was occurred.
1st) I could not optimise learning conditions.
2nd) I did not have enough amount of training dataset. But the problem is difficult to solve in the actual drug discovery project.
3rd) Long learning time was not good, so early stopping is better.

Keras has same callback functions. And there is early stopping function too!
So, I wrote some code.
1st, make dataset for hERG binary classification.


from __future__ import print_function
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np
import pickle

df = pd.read_table('bioactivity-15_13-09-28.txt', header=0)
df_bind = df[ df.ASSAY_TYPE=="B" ]
df_bind = df_bind[ df_bind.STANDARD_VALUE != None ]
df_bind = df_bind[ df_bind.STANDARD_VALUE >= 0 ]

rows = df_bind.shape[ 0 ]
mols = [ ]
act = [ ]
fps = []
def act2bin( val ):
    if val > 10000:
        return 0
        return 1

for i in range( rows ):
        smi = df_bind.CANONICAL_SMILES[i]
        mol = Chem.MolFromSmiles( smi )
        if mol != None:
            mols.append( mol )
            act.append( act2bin( df_bind.STANDARD_VALUE[i]) )
for mol in mols:
    arr = np.zeros( (1,) )
    fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=1024 )
    DataStructs.ConvertToNumpyArray( fp, arr )
    fps.append( fp )

fps = np.array( fps, dtype = np.float32 )
act = np.array( act, dtype = np.int32 )

train_x, test_x, train_y, test_y = train_test_split( fps, act, test_size=0.3, random_state=5 )

f = open('dataset_fp1024.pkl', 'wb')
pickle.dump( [train_x, train_y, test_x, test_y ], f )

Distribution of logIC50



Then wrote model builder using keras.
EarlyStopping Class provides early stop function.
And If I want to use the function, it’s easy. Just set callbacks option in fit().
Following code, I set another callback function ‘ModelCheckpoint’ to save parameters.
Finally I got better results not only training set but also validation set.
Keras is easy and good tool for deep learning.

import pickle
import matplotlib
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adagrad, Adam, Adadelta
from keras.utils.visualize_util import plot
from keras.utils import np_utils
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint

earlystopping = EarlyStopping( monitor='val_loss',
                             patience = 4,
                             mode='auto' )

modelcheckpoint = ModelCheckpoint( './bestmodel.hdf5',
                                   save_best_only=True )

nb_epoch = 250
f = open( 'dataset_fp1024.pkl', 'rb' )
train_x, train_y, test_x, test_y = pickle.load( f )
train_y = np_utils.to_categorical(train_y, 2)
test_y = np_utils.to_categorical(test_y, 2)

model = Sequential()
model.add( Dense( output_dim = 500, input_shape=(1024,) ) )
model.add( Activation( 'sigmoid' ) )
model.add( Dropout(0.2) )
model.add( Dense( output_dim = 100 ) )
model.add( Activation( 'sigmoid' ))
model.add( Dropout(0.2) )

model.add( Dense( output_dim = 20 ) )
model.add( Activation( 'sigmoid' ))
model.add( Dropout(0.2) )

model.add( Dense( 2 ) )
model.add( Activation( 'sigmoid' ) )

model.compile( optimizer=Adadelta(lr=0.95), loss='categorical_crossentropy' )

hist = train_x, train_y ,
                     callbacks=[earlystopping, modelcheckpoint, monitor])
score = model.evaluate( test_x, test_y, show_accuracy=True, verbose=1 )
print( ' testscore: ', score[0], ' testaccuracy: ', score[1] )

loss = hist.history[ 'loss' ]
val_loss = hist.history[ 'val_loss' ]
plt.plot( range( len(val_loss ) ), loss, label='loss' )
plt.plot( range( len( val_loss ) ), val_loss,  label = 'val_loss' )
plt.legend( loc='best', fontsize=10 )
plt.xlabel( 'epoch' )
plt.ylabel( 'loss' )
plt.savefig( 'res_gpu.png' )
acc = hist.history[ 'acc' ]
val_acc = hist.history[ 'val_acc' ]
plt.plot( range( len( acc )), acc, label='accuracy' )
plt.plot( range( len( val_acc )), val_acc, label='val_accuracy' )
plt.xlabel( 'epoch' )
plt.ylabel( 'accuracy' )
plt.savefig( 'acc_gpu.png' )



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s