I’m still building QSAR models using deep learning. And I thought I got problem of over fitting. :-)
Training error was decreasing but, validation error was increasing depend on number of epochs. :-/
It seems over fitting and I could not avoid the event even if I used drop out function.
Tried lots of learning conditions but all challenge was failed…..finally I thought that too long learning did not have good effect for QSAR.
I thought reason why over fitting was occurred.
1st) I could not optimise learning conditions.
2nd) I did not have enough amount of training dataset. But the problem is difficult to solve in the actual drug discovery project.
3rd) Long learning time was not good, so early stopping is better.
Keras has same callback functions. And there is early stopping function too!
So, I wrote some code.
1st, make dataset for hERG binary classification.
from __future__ import print_function from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import DataStructs from sklearn.cross_validation import train_test_split import pandas as pd import numpy as np import pickle df = pd.read_table('bioactivity-15_13-09-28.txt', header=0) df_bind = df[ df.ASSAY_TYPE=="B" ] df_bind = df_bind[ df_bind.STANDARD_VALUE != None ] df_bind = df_bind[ df_bind.STANDARD_VALUE >= 0 ] rows = df_bind.shape[ 0 ] mols = [ ] act = [ ] fps = [] def act2bin( val ): if val > 10000: return 0 else: return 1 for i in range( rows ): try: smi = df_bind.CANONICAL_SMILES[i] mol = Chem.MolFromSmiles( smi ) if mol != None: mols.append( mol ) act.append( act2bin( df_bind.STANDARD_VALUE[i]) ) else: pass except: pass for mol in mols: arr = np.zeros( (1,) ) fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=1024 ) DataStructs.ConvertToNumpyArray( fp, arr ) fps.append( fp ) fps = np.array( fps, dtype = np.float32 ) act = np.array( act, dtype = np.int32 ) train_x, test_x, train_y, test_y = train_test_split( fps, act, test_size=0.3, random_state=5 ) f = open('dataset_fp1024.pkl', 'wb') pickle.dump( [train_x, train_y, test_x, test_y ], f ) f.close()
Distribution of logIC50
Then wrote model builder using keras.
EarlyStopping Class provides early stop function.
And If I want to use the function, it’s easy. Just set callbacks option in fit().
Following code, I set another callback function ‘ModelCheckpoint’ to save parameters.
Finally I got better results not only training set but also validation set.
Keras is easy and good tool for deep learning.
import pickle import matplotlib import matplotlib.pyplot as plt from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation from keras.optimizers import SGD, Adagrad, Adam, Adadelta from keras.utils.visualize_util import plot from keras.utils import np_utils from keras.callbacks import EarlyStopping from keras.callbacks import ModelCheckpoint earlystopping = EarlyStopping( monitor='val_loss', patience = 4, verbose=0, mode='auto' ) modelcheckpoint = ModelCheckpoint( './bestmodel.hdf5', monitor='val_loss', verbose=0, save_best_only=True ) nb_epoch = 250 f = open( 'dataset_fp1024.pkl', 'rb' ) train_x, train_y, test_x, test_y = pickle.load( f ) train_y = np_utils.to_categorical(train_y, 2) test_y = np_utils.to_categorical(test_y, 2) model = Sequential() model.add( Dense( output_dim = 500, input_shape=(1024,) ) ) model.add( Activation( 'sigmoid' ) ) model.add( Dropout(0.2) ) model.add( Dense( output_dim = 100 ) ) model.add( Activation( 'sigmoid' )) model.add( Dropout(0.2) ) model.add( Dense( output_dim = 20 ) ) model.add( Activation( 'sigmoid' )) model.add( Dropout(0.2) ) model.add( Dense( 2 ) ) model.add( Activation( 'sigmoid' ) ) model.compile( optimizer=Adadelta(lr=0.95), loss='categorical_crossentropy' ) hist = model.fit( train_x, train_y , nb_epoch=nb_epoch, batch_size=50, validation_split=0.1, show_accuracy=True, callbacks=[earlystopping, modelcheckpoint, monitor]) score = model.evaluate( test_x, test_y, show_accuracy=True, verbose=1 ) print( ' testscore: ', score[0], ' testaccuracy: ', score[1] ) model.summary() loss = hist.history[ 'loss' ] val_loss = hist.history[ 'val_loss' ] plt.plot( range( len(val_loss ) ), loss, label='loss' ) plt.plot( range( len( val_loss ) ), val_loss, label = 'val_loss' ) plt.legend( loc='best', fontsize=10 ) plt.grid() plt.xlabel( 'epoch' ) plt.ylabel( 'loss' ) plt.savefig( 'res_gpu.png' ) plt.close() acc = hist.history[ 'acc' ] val_acc = hist.history[ 'val_acc' ] plt.plot( range( len( acc )), acc, label='accuracy' ) plt.plot( range( len( val_acc )), val_acc, label='val_accuracy' ) plt.xlabel( 'epoch' ) plt.ylabel( 'accuracy' ) plt.savefig( 'acc_gpu.png' )