Callback function of keras.

I’m still building QSAR models using deep learning. And I thought I got problem of over fitting. :-)
Training error was decreasing but, validation error was increasing depend on number of epochs. :-/
It seems over fitting and I could not avoid the event even if I used drop out function.

Tried lots of learning conditions but all challenge was failed…..finally I thought that too long learning did not have good effect for QSAR.
I thought reason why over fitting was occurred.
1st) I could not optimise learning conditions.
2nd) I did not have enough amount of training dataset. But the problem is difficult to solve in the actual drug discovery project.
3rd) Long learning time was not good, so early stopping is better.

Keras has same callback functions. And there is early stopping function too!
So, I wrote some code.
1st, make dataset for hERG binary classification.

 

from __future__ import print_function
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np
import pickle

df = pd.read_table('bioactivity-15_13-09-28.txt', header=0)
df_bind = df[ df.ASSAY_TYPE=="B" ]
df_bind = df_bind[ df_bind.STANDARD_VALUE != None ]
df_bind = df_bind[ df_bind.STANDARD_VALUE >= 0 ]

rows = df_bind.shape[ 0 ]
mols = [ ]
act = [ ]
fps = []
def act2bin( val ):
    if val > 10000:
        return 0
    else:
        return 1

for i in range( rows ):
    try:
        smi = df_bind.CANONICAL_SMILES[i]
        mol = Chem.MolFromSmiles( smi )
        if mol != None:
            mols.append( mol )
            act.append( act2bin( df_bind.STANDARD_VALUE[i]) )
        else:
            pass
    except:
        pass
for mol in mols:
    arr = np.zeros( (1,) )
    fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=1024 )
    DataStructs.ConvertToNumpyArray( fp, arr )
    fps.append( fp )

fps = np.array( fps, dtype = np.float32 )
act = np.array( act, dtype = np.int32 )

train_x, test_x, train_y, test_y = train_test_split( fps, act, test_size=0.3, random_state=5 )

f = open('dataset_fp1024.pkl', 'wb')
pickle.dump( [train_x, train_y, test_x, test_y ], f )
f.close()

Distribution of logIC50

ic50dist

 

Then wrote model builder using keras.
EarlyStopping Class provides early stop function.
And If I want to use the function, it’s easy. Just set callbacks option in fit().
Following code, I set another callback function ‘ModelCheckpoint’ to save parameters.
Finally I got better results not only training set but also validation set.
Keras is easy and good tool for deep learning.

import pickle
import matplotlib
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adagrad, Adam, Adadelta
from keras.utils.visualize_util import plot
from keras.utils import np_utils
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint

earlystopping = EarlyStopping( monitor='val_loss',
                             patience = 4,
                             verbose=0,
                             mode='auto' )

modelcheckpoint = ModelCheckpoint( './bestmodel.hdf5',
                                   monitor='val_loss',
                                   verbose=0,
                                   save_best_only=True )

nb_epoch = 250
f = open( 'dataset_fp1024.pkl', 'rb' )
train_x, train_y, test_x, test_y = pickle.load( f )
train_y = np_utils.to_categorical(train_y, 2)
test_y = np_utils.to_categorical(test_y, 2)

model = Sequential()
model.add( Dense( output_dim = 500, input_shape=(1024,) ) )
model.add( Activation( 'sigmoid' ) )
model.add( Dropout(0.2) )
model.add( Dense( output_dim = 100 ) )
model.add( Activation( 'sigmoid' ))
model.add( Dropout(0.2) )

model.add( Dense( output_dim = 20 ) )
model.add( Activation( 'sigmoid' ))
model.add( Dropout(0.2) )

model.add( Dense( 2 ) )
model.add( Activation( 'sigmoid' ) )

model.compile( optimizer=Adadelta(lr=0.95), loss='categorical_crossentropy' )

hist = model.fit( train_x, train_y ,
                     nb_epoch=nb_epoch,
                     batch_size=50,
                     validation_split=0.1,
                     show_accuracy=True,
                     callbacks=[earlystopping, modelcheckpoint, monitor])
score = model.evaluate( test_x, test_y, show_accuracy=True, verbose=1 )
print( ' testscore: ', score[0], ' testaccuracy: ', score[1] )
model.summary()

loss = hist.history[ 'loss' ]
val_loss = hist.history[ 'val_loss' ]
plt.plot( range( len(val_loss ) ), loss, label='loss' )
plt.plot( range( len( val_loss ) ), val_loss,  label = 'val_loss' )
plt.legend( loc='best', fontsize=10 )
plt.grid()
plt.xlabel( 'epoch' )
plt.ylabel( 'loss' )
plt.savefig( 'res_gpu.png' )
plt.close()
acc = hist.history[ 'acc' ]
val_acc = hist.history[ 'val_acc' ]
plt.plot( range( len( acc )), acc, label='accuracy' )
plt.plot( range( len( val_acc )), val_acc, label='val_accuracy' )
plt.xlabel( 'epoch' )
plt.ylabel( 'accuracy' )
plt.savefig( 'acc_gpu.png' )

Accuracy…
acc_gpu
Loss….
res_gpu

Retrieve MMP square from MMP database.

I wrote blog post about mmp and neo4j somedays ago.
I thought that I could retrieve mmp square from neo4j.
MMP square means 4 molecules pairs like a following relationship.
mol1 => mol2 (MMP), mol2 => mol3 (MMP), mol3 => mol4 (MMP), mol4 => mol1(MMP)
The relationship is important to think about additive or non additive SAR.
And cypher can search the square very simply.
Let’s try it.
At first, I prepare dataset from chemblDB CYP3A4 inhibition data.

import pandas as pd
import numpy as np
df = pd.read_table( "bioactivity-16_12-40-28.txt", header=0, low_memory=False )
df.shape
Out[6]:(17143, 55)

df2 = df[["CANONICAL_SMILES", "MOLREGNO" ]]
df2.to_csv('chembl_cyp3a4.csv', index=False)

from rdkit import Chem
from rdkit import rdBase

!python mmpa/rfrag.py < ./chembl_cyp3a4.csv > ./cyp3a4_frag.txt
!python mmpa/indexing.py -s -r 0.1 < ./cyp3a4_frag.txt > ./cyp3a4_mmp.txt
mmps= pd.read_csv(  'cyp3a4_mmp.txt' , header=None, names = ('smi1','smi2','id1','id2','tform','core'))
mmps.shape

Out[23]:(45096, 6)

Data preparation was finished.
Then read data from neo4j-shell
I used LOAD CSV WITH HEADERS function to do it.

neo4j-sh (?)$ LOAD CSV WITH HEADERS FROM 'file:///path/mmp_cyp3a4/chembl_cyp3a4mmp.csv' AS line
> MERGE (a:mol {smi:line.smi1, molregno: line.id1})
> MERGE (b:mol {smi:line.smi2, molregno: line.id2})
> MERGE (a)-[:MMP {tform:line.tform, core:line.core} ]->(b);
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 2463
Relationships created: 36388
Properties set: 77702
Labels added: 2463
191456 ms

OK, Finally Search mmp square using cypher.
Cypher does not allow query that has same node symbol in a path, so I wrote comma separated query.

neo4j-sh (?)$ MATCH (n)-[r1]->(a)-[r2]->(b)-[r3]->(c), (c)-[r4]->(n) RETURN n.smi,r1.tform,a.smi,r2.tform,b.smi,r3.tform,c.smi,r4.tform LIMIT 1;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| n.smi                               | r1.tform          | a.smi                              | r2.tform          | b.smi                               | r3.tform          | c.smi                              | r4.tform          |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "Cc1cccc(CNc2cc(ncn2)c3ccccc3Cl)c1" | "Cl[*:1]>>C[*:1]" | "Cc1cccc(CNc2cc(ncn2)c3ccccc3C)c1" | "C[*:1]>>CO[*:1]" | "COc1ccccc1c2cc(NCc3cccc(C)c3)ncn2" | "CO[*:1]>>C[*:1]" | "Cc1cccc(CNc2cc(ncn2)c3ccccc3C)c1" | "C[*:1]>>Cl[*:1]" |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
14266 ms

Wow! It works fine!
Next visualise result using rdkit!
New version of rdkit can draw molecule as SVG very easily.

from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.ipython_useSVG=True
tforms= "Cc1cccc(CNc2cc(ncn2)c3ccccc3Cl)c1", "Cl[*:1]>>C[*:1]" , "Cc1cccc(CNc2cc(ncn2)c3ccccc3C)c1" , "C[*:1]>>CO[*:1]" , "COc1ccccc1c2cc(NCc3cccc(C)c3)ncn2" , "CO[*:1]>>C[*:1]" , "Cc1cccc(CNc2cc(ncn2)c3ccccc3C)c1" ,"C[*:1]>>Cl[*:1]"
# Ignored following error / the following code can not read transforms.
molobj = [ Chem.MolFromSmiles(smi) for smi in tforms ]
# Draw them!
Draw.MolsToGridImage( [molobj[0],molobj[2], molobj[4], molobj[6]], molsPerRow=4 )

mmp_square

I want to develop chemoinformatics tools using rdkit and neo4j.

visualise progress of loop.

Somedays ago, I read a blog post about useful python library and I found cool library.
The name of library was ‘tqdm’.
The library can visualise progress bar very easily.
For example, if I want to visualise progress of loops, I could do it only call tqdm.
That’s all. ;-)
Then I could get following progress bar.
Hmm That’s cool.

In [1]: from rdkit import Chem
In [2]: from tqdm import tqdm
In [3]: sdf = Chem.SDMolSupplier( 'cdk2.sdf' )
In [4]: mols = [ mol for mol in tqdm( sdf ) ]
100%|█████████████████████████████████████████| 47/47 [00:00<00:00, 1081.43it/s]
In [8]: for mol in tqdm( mols ):
   ...:     print( Chem.MolToSmiles( mol ) )
   ...:     
  0%|                                                    | 0/47 [00:00<?, ?it/s]CC(C)C(=O)COc1nc(N)nc2[nH]cnc12
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1
Nc1nc(OCC2CCCCC2)c2nc[nH]c2n1
~~~~~~~~~cut~~~~~~~~~~~~~~~~
O=C1Nc2ccc3ncsc3c2C1=CNc1ccc(S(=O)(=O)Nc2ccccn2)cc1
100%|█████████████████████████████████████████| 47/47 [00:00<00:00, 3283.24it/s]

Use Neo4j to store MMP data.

I read news about ‘panama papers’.
Data of panama pares was analysed with graph database! It was exciting for me.
http://neo4j.com/blog/analyzing-panama-papers-neo4j/
So, I’m interested in neo4j.
Fortunately, Mac user can install neo4j by using homebrew. ;-)
Neo4j has original SQL like language named Cypher.
I used cypher to read data following reason.
At first I used py2neo but it was difficult to handle large dataset. Because py2neo communicates with neo4j server using REST and it took long time to read data, caused time out.

Fist step of starting Cyper is Create node, and relation.
It can with simple way.

Node is created following command.
CREATE ( n: name { property:value, …. } )
And relation is created following.
CREATE (n)-[ r:name {property: value, ….} ]->(n1)
(n)-[r]->() represents relation represents directed graph.
And It was easy to set properties. ;-)
Let’s make sample dataset. I got mmp data from following url.
https://zenodo.org/record/8418#.VxLvURN97Uo
I renamed data and checked data.

iwatobipen$ head ChEMBL17_IC50_RECAP_MMP_list.csv 
Target_ChEMBLID,Cpd1_ChEMBLID,Cpd2_ChEMBLID,KeyFragment,Transformation,Cpd1_SMILES,Cpd2_SMILES
CHEMBL1075097,CHEMBL2348488,CHEMBL2326095,[R1]CCC(CCCCB(O)O)(C(=O)O)N,[R1]N(CC)CC>>[R1]N1CCCC1,OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-],OB(O)CCCC[C@@]([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097,CHEMBL2326085,CHEMBL2348486,[R1]CCC(CCCCB(O)O)(C(=O)O)N,[R1]N(CC)CC>>[R1]N1CCCC1,OB(O)CCCC[C@@]([NH3+])(CC[NH+](CC)CC)C(=O)[O-],OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097,CHEMBL2348488,CHEMBL2348486,[R1]CCC(CCCCB(O)O)(C(=O)O)N,[R1]N(CC)CC>>[R1]N1CCCC1,OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-],OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
iwatobipen$ wc ChEMBL17_IC50_RECAP_MMP_list.csv 
  240323  240323 60542670 ChEMBL17_IC50_RECAP_MMP_list.csv

The data has mmp information of ChEMBL17.
Then load data from neo4j-shell.
The csv file has header, so I used command LOAD CSV WITH HEADERS FROM…
I got 10 thousand data from original dataset.
iwatobioen $ head -n 10000 ChEMBL17_IC50_RECAP_MMP_list.csv > ChEMBL17_IC50_RECAP_MMP_10000list.csv
To load 10 thousand of data, it took lots time.

neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV WITH HEADERS FROM "file:////Users/iwatobipen/develop/py3env/neo4jtest/ChEMBL17_IC50_RECAP_MMP_10000list.csv" AS line
> MERGE (m1:mol { molid: line.Cpd1_ChEMBLID, smi: line.Cpd1_SMILES })
> MERGE (m2:mol { molid: line.Cpd2_ChEMBLID, smi: line.Cpd2_SMILES })
> CREATE (m1)-[r:MMP { transform:line.Transformation, targetid:line.Target_ChEMBLID}]->(m2); # Maybe MERGE is better...
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 3522
Relationships created: 9999
Properties set: 27042
Labels added: 3522
63150 ms

Hmm It needed too long time, I want to know more efficient way.
Anyway, I loaded data to graph database.
Then access http://localhost:7474 (default settings.)

I got following image …
neo4jtop

Write Query. “Select node and count relation as degree”.

$MATCH (node)-[r]->() RETURN node, count (r) AS degree ORDER BY degree DESC LIMIT 25;

query_res_cypher2

query_cypher1

I think graph database is useful to detect molecular matched series(MMS) because MMS is represented a path like nodeA -> nodeB -> nodeC ….
And also Molecular Matched square is represented a path like nodeA->nodeB->nodeC->nodeA.
These path is easily detect by using Cypher.

MATCH (n)-->(a)-->(b)
RETURN n, a, b

or

MATCH (n)-->(a)-->(b)-->(n)
RETURN n, a, b

Try it…

$MATCH (n)-->(a)-->(n) RETURN n,a LIMIT 100;

Then I got following result.
query_res_cypher3

I think neo4j is interesting for chemoinformatics and I want to analyse in-house and public data ASAP.

New library for deep learning

Deep learning is old but new technology of machine learning. I have been interested in this technology however it was difficult to optimise lots of parameters.

I posted on my blog about python library named ‘chainer’ before. Chainer is one of flexible frame work for NN.
And somedays ago, I found new library named ‘keras’
http://keras.io/
The library uses theano or tensorflow. And user be able to develop code more simply.
So, I coded sample code to test the library.
First, I made sample dataset from Chembl hERG assay.
Input layer was calculated by rdkit as ECFP.
I wrote following code python3.x env.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
import pandas as pd
import numpy as np
import pickle

df = pd.read_table('bioactivity-15_13-09-28.txt', header=0)
df_bind = df[ df.ASSAY_TYPE=="B" ]
df_bind = df_bind[ df_bind.STANDARD_VALUE != None ]
rows = df_bind.shape[ 0 ]
mols = [ ]
act = [ ]
fps = []
def act2bin( val ):
    if val > 10000:
        return 0
    else:
        return 1

for i in range( rows ):
    try:
        smi = df_bind.CANONICAL_SMILES[i]
        mol = Chem.MolFromSmiles( smi )
        if mol != None:
            mols.append( mol )
            act.append( act2bin( df_bind.STANDARD_VALUE[i]) )
        else:
            pass
    except:
        pass
for mol in mols:
    arr = np.zeros( (1,) )
    fp = AllChem.GetMorganFingerprintAsBitVect( mol, 2, nBits=256 )
    DataStructs.ConvertToNumpyArray( fp, arr )
    fps.append( fp )

fps = np.array( fps, dtype = np.float32 )
act = np.array( act, dtype = np.int32 )
#act = np.array(act, dtype=np.float32)
train_x, test_x, train_y, test_y = train_test_split( fps, act, test_size=0.3, random_state=455 )

f = open('dataset_fp256.pkl', 'wb')
pickle.dump( [train_x, train_y, test_x, test_y ], f )
f.close()

Then I made NN model using keras. (keras is easily installed by pip.)
First, I need to set model class, keras has two model class, Sequential, Graph. I used first one.
Then, define the model structure.
I think the manner is very easy to understand, I set each layer using ‘model.add(….)’.
Drop out is also handled as a layer.
Of course, keras has many activation function. Tanh, ReLu, Sigmoid, etc.
After define the model, user need to compile the model.
That’s all!
Fit function can build the predictive model! It’s like scikit-learn ;-).
When user set validation_split argument, keras automatically split train and validation data. Cool!
Finally to predict the data, model.evaluate() was used.
It seems very simple way.
And interesting point was that, when running the code, user can check the progress of learning through the console.
Like that.

3500/4861 [====================>.........] - ETA: 0s - loss: 0.5465 - acc: 0.7146
4000/4861 [=======================>......] - ETA: 0s - loss: 0.5412 - acc: 0.7195
4500/4861 [==========================>...] - ETA: 0s - loss: 0.5358 - acc: 0.7231
4861/4861 [==============================] - 0s - loss: 0.5390 - acc: 0.7219 - val_loss: 0.5399 - val_acc: 0.7338
Epoch 97/500

 500/4861 [==>...........................] - ETA: 0s - loss: 0.5182 - acc: 0.7360
1000/4861 [=====>........................] - ETA: 0s - loss: 0.5208 - acc: 0.7240
1500/4861 [========>.....................] - ETA: 0s - loss: 0.5250 - acc: 0.7240
2000/4861 [===========>..................] - ETA: 0s - loss: 0.5311 - acc: 0.7240
2500/4861 [==============>...............] - ETA: 0s - loss: 0.5264 - acc: 0.7280
3000/4861 [=================>............] - ETA: 0s - loss: 0.5302 - acc: 0.7240
3500/4861 [====================>.........] - ETA: 0s - loss: 0.5374 - acc: 0.7191
4000/4861 [=======================>......] - ETA: 0s - loss: 0.5411 - acc: 0.7180
4500/4861 [==========================>...] - ETA: 0s - loss: 0.5396 - acc: 0.7202
4861/4861 [==============================] - 0s - loss: 0.5402 - acc: 0.7194 - val_loss: 0.5379 - val_acc: 0.7412
Epoch 98/500

And when user can use GPU, type command like
iwatobipen$ time THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python hogehoge.py
Training history was stored in model.fit object and user can call history later.
It was useful for new model development.
However I could not optimise model yet. Model response is very sensitive to the parameters.
(In my case, too many bits, ReLu function, too many number of layers setting was not worked well.)
Is training data too small to build model?

If you have any ideas, I’d like to hear them.

import pickle
import matplotlib
import matplotlib.pyplot as plt

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adagrad
nb_epoch = 500
f = open( 'dataset_fp256.pkl', 'rb' )
trainx, trainy, testx, testy = pickle.load( f )

model = Sequential()
model.add( Dense( output_dim = 50, init='uniform', input_dim=256 ) )
model.add( Activation( 'tanh' ) )
#model.add(Dropout(0.5))
model.add( Dense( 1 ) )
model.add( Activation( 'sigmoid' ) )

model.compile( optimizer='adagrad', loss='binary_crossentropy' )

hist = model.fit( trainx, trainy ,
                     nb_epoch=nb_epoch,
                     batch_size=500,
                     validation_split=0.1,
                     show_accuracy=True)

score = model.evaluate( testx, testy, show_accuracy=True, verbose=1 )
print( ' testscore: ', score[0], ' testaccuracy: ', score[1] )

loss = hist.history[ 'loss' ]
val_loss = hist.history[ 'val_loss' ]
plt.plot( range( nb_epoch ), loss, label='loss' )
plt.plot( range( nb_epoch ), val_loss,  label = 'val_loss' )
plt.legend( loc='best', fontsize=10 )
plt.grid()
plt.xlabel( 'epoch' )
plt.ylabel( 'loss' )
plt.savefig( 'res.png' )

res_gpu


					

Scaffold replacements in VEGFR-2 inhibitors.

I often think about scaffold replacement strategy in my project.

Sometime, scaffold replacement is powerful strategy. However I think brute-force search of another scaffold is not effective. It too heavy task to build new core frame if there are no commercially available starting material.

Recently researcher in Novartis was reported a letter in JMCL. The title is attractive for me.

http://pubs.acs.org/doi/abs/10.1021/acsmedchemlett.6b00018

The author had already potent compound that has some issues, hERG, solubility etc.

To solve these issues, they tried to replace core from indole to any other 56fused rings.

In table1, There are some examples. And Entry 5 showed good potency and improve solubility and hERG. I think the result depends on reduce basicity and break aromaticity of the scaffold.

And the author described some potent derivatives. Azaindole derivative showed good profile (but loss potency…).

SAR expansion of core structure was informative for me, but unfortunately, there aren’t the discussion about why they chose these core structures.

Is it from medchem’s experience, or rationally designed strategy, or ….. ?

 

Where is the key to open the door?

Ran Nihondaira-sakura marathon.

I entered nihondaira-sakura marathon and ran 23.5km! today. ;-)
http://nihondaira-sakura-marathon.jp/
I was out of practice, but fortunately I could finish at the last minute.

The course was tough. At first 10km was uphill and the downhill 7km…..
img_map_big

Race condition was very good. Today was cloudy and not so hot.
I enjoyed the marathon festival, and cherry blossom. Also! I was exhausted….
Thanks a lot for warm and kindness cheer.

And I’ll start running training again!