molecular classification using molecular images (failed)

In my opinion, Good Medicinal Chemists has good eye.
I think they can classify molecules efficiently. It obvious but almost of them can’t under stand molecular fingerprint. My question is that how do they design molecules, 3D, 2D or their own experiment?
If they design molecules based 2D, deep learning is quite useful.
Deep learning is powerful method to image classification in these days.
So, I tried to make molecular image classification code.
At first I coded mol to image script. The code is following.
I used hERG dataset from ChEMBLDB.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.cross_validation import train_test_split
from rdkit.Chem import Draw
import pandas as pd
import numpy as np
import cv2 as cv
import glob
import numpy as np
import pickle
from sklearn.cross_validation import train_test_split

df = pd.read_table('bioactivity-15_13-09-28.txt', header=0)
df_bind = df[ df.ASSAY_TYPE=="B" ]
df_bind = df_bind[ df_bind.STANDARD_VALUE != None ]
df_bind = df_bind[ df_bind.STANDARD_VALUE >= 0 ]

rows = df_bind.shape[ 0 ]
mols = [ ]
act = [ ]

def act2bin( val ):
    if val > 10000:
        return 0
    else:
        return 1

for i in range( rows ):
    try:
        smi = df_bind.CANONICAL_SMILES[i]
        mol = Chem.MolFromSmiles( smi )
        if mol != None:
            mols.append( mol )
            act.append( act2bin( df_bind.STANDARD_VALUE[i]) )
        else:
            pass
    except:
        pass

# save mols image dataset
for idx, mol in enumerate( mols ):
    if act[ idx ] == 1:
        Draw.MolToFile( mols[idx],"./posi/idx_{}.png".format( idx ), size=(150,150)  )
    elif act[ idx ] == 0:
        Draw.MolToFile( mols[idx], "./nega/idx_{}.png".format( idx ), size=(150,150) )

# transpose image shape from [ 300,300,3 ] to [ 3, 300, 300 ]
def transimage( image ):
    im = cv.imread( image )
    im = im.transpose( [ 2,0,1 ] )
    return im
# get image filenames
pos = glob.glob( 'posi/*.png' )
neg = glob.glob( 'nega/*.png' )

X = []
Y = []

for posf in pos:
    x = transimage( posf )
    X.append( x )
    Y.append( 1 )
for negf in neg:
    x = transimage( negf )
    X.append( x )
    Y.append( 0 )
X = np.asarray(X)
Y = np.asarray(Y)

x_train, x_test, y_train, y_test = train_test_split( X,Y, test_size=0.2, random_state=123 )

f = open( 'imagedataset.pkl', 'wb' )
pickle.dump([ ( x_train,y_train ), ( x_test, y_test ) ], f)
f.close()

Next, I coded learner and classifer.
I used CNN.

import numpy as np

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout, Flatten
from keras.layers.convolutional import Convolution2D, MaxPooling2D
from keras.utils import np_utils
import pickle

( x_train, y_train ), ( x_test, y_test ) = pickle.load( open('imagedataset.pkl', 'rb') )

batch_size = 100
nb_classes = 2
nb_epoch = 20
nb_filters = 32
nb_pool = 2
nb_conv = 3
im_rows , im_cols = 150, 150
im_channels = 3

x_train = x_train.astype( "float32" )
x_test = x_test.astype( "float32" )
x_train /= 255
x_test /= 255
y_train = np_utils.to_categorical( y_train, nb_classes )
y_test = np_utils.to_categorical( y_test, nb_classes )

print( x_train.shape[0], 'train samples' )
print( x_test.shape[0], 'test samples' )


model = Sequential()
model.add( Convolution2D( 10, 3, 3,
                            border_mode = 'same',
                            input_shape = ( im_channels, im_rows, im_cols ) ) )
model.add( Activation( 'relu' ) )
model.add( MaxPooling2D( pool_size=(2, 2) ) )
model.add( Convolution2D( 10, 3, 3,
                             ))
model.add( Activation( 'relu' ) )
model.add( Convolution2D( 10, 3, 3 ) )
model.add( Activation( 'relu' ) )
model.add( MaxPooling2D( pool_size=( 2, 2 ) ) )
model.add( Flatten() )
model.add( Dense( 10 ) )
model.add( Activation( 'relu' ) )
model.add( Dense(nb_classes) )
model.add( Activation('softmax') )
model.compile( loss='categorical_crossentropy',
               optimizer='adadelta',
               metrics=['accuracy'],
               )

hist = model.fit( x_train, y_train,
                  batch_size = batch_size,
                  nb_epoch = nb_epoch,
                  verbose = 1,
                  validation_data = ( x_test, y_test ))
print( model.summary() )
score = model.evaluate( x_test, y_test, verbose=0 )

Then run code.

iwatobipen$ python learn_mol_image.py 
Using Theano backend.
Using gpu device 0: GeForce GT 750M (CNMeM is disabled, cuDNN 5004)
4364 train samples
1092 test samples
Train on 4364 samples, validate on 1092 samples
Epoch 1/20
4364/4364 [==============================] - 15s - loss: 5.9538 - acc: 0.6244 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 2/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 3/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 4/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 5/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 6/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 7/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 8/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 9/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 10/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 11/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 12/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 13/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 14/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 15/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 16/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 17/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 18/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 19/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
Epoch 20/20
4364/4364 [==============================] - 15s - loss: 6.0461 - acc: 0.6249 - val_loss: 6.2288 - val_acc: 0.6136
____________________________________________________________________________________________________
Layer (type)                       Output Shape        Param #     Connected to                     
====================================================================================================
convolution2d_1 (Convolution2D)    (None, 10, 150, 150)280         convolution2d_input_1[0][0]      
____________________________________________________________________________________________________
activation_1 (Activation)          (None, 10, 150, 150)0           convolution2d_1[0][0]            
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D)      (None, 10, 75, 75)  0           activation_1[0][0]               
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)    (None, 10, 73, 73)  910         maxpooling2d_1[0][0]             
____________________________________________________________________________________________________
activation_2 (Activation)          (None, 10, 73, 73)  0           convolution2d_2[0][0]            
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)    (None, 10, 71, 71)  910         activation_2[0][0]               
____________________________________________________________________________________________________
activation_3 (Activation)          (None, 10, 71, 71)  0           convolution2d_3[0][0]            
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)      (None, 10, 35, 35)  0           activation_3[0][0]               
____________________________________________________________________________________________________
flatten_1 (Flatten)                (None, 12250)       0           maxpooling2d_2[0][0]             
____________________________________________________________________________________________________
dense_1 (Dense)                    (None, 10)          122510      flatten_1[0][0]                  
____________________________________________________________________________________________________
activation_4 (Activation)          (None, 10)          0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                    (None, 2)           22          activation_4[0][0]               
____________________________________________________________________________________________________
activation_5 (Activation)          (None, 2)           0           dense_2[0][0]                    
====================================================================================================
Total params: 124632
____________________________________________________________________________________________________
None

It’s hopeless…. ;-(
The code can’t learn from dataset.

Hmm…..Is size of image too small ?
Maybe molecular description as image is not right way manner…..
少しはいけるかなあって思ってみたんですが安直だったかな。

Advertisements

2 thoughts on “molecular classification using molecular images (failed)

  1. May be try with a target which is not hERG. Data for this set may be one of the most diverse set (anti-target and reported for diverse series and generally not something people want to go for).
    Also, one think that could be done to see if image works is to take a kinase and a GPCR target and see if the distinction using image as descriptor can be done.

    Would have been to work here, but failure is not that unexpected.

    • Than you for your comment! Yes, hERG binder data set is very diverse. And in the method, I must think about rotation and/ or translation of images( equal to molecular alignment ).
      I’ll try to kinase or GPCR.
      BTW, ideally I’m expected that deep learning can extract molecular features from image(i.e. pharmacophore) .
      It’s mean artificial intelligence of drug design…..
      I’m still thinking about what is image of molecule. 😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s