Molecular encoder/decoder (VAE) with python3.x (not new topic) #rdkit #chemoinformatics

The first day of 10-day holiday is rainy. And I and my kid will go to dodge ball tournament.

Two years ago, I tried to modify the keras-molecule which is code of molecular encoder decoder. The code is written for python 2.x. So I would like to run the code on python 3.6. I stopped the modification because I often use pytorch and found molencoder for pytorch. https://github.com/cxhernandez/molencoder

Today I tried to modify keras-molecule again.
I think almost done and the code will run on python 3.x, with keras 2.x tensorflow backend.

https://github.com/iwatobipen/keras-molecules

I checked the code on google colab. Because google colab provides GPU environment. It is useful for deep learning. At first, I installed conda and several packages which are used in keras-molecule.

Following code was written on google colab.

# install conda
!wget https://repo.continuum.io/miniconda/Miniconda3-4.4.10-Linux-x86_64.sh
!chmod +x ./Miniconda3-4.4.10-Linux-x86_64.sh
!time bash ./Miniconda3-4.4.10-Linux-x86_64.sh -b -f -p /usr/local

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
# install packages via conda command
!time conda install -y -c conda-forge rdkit
!conda install -y -c conda-forge h5py
!conda install -y -c conda-forge scikit-learn
!conda install -y -c conda-forge pytables
!conda install -y -c conda-forge progressbar
!conda install -y -c conda-forge pandas
!conda install -y -c conda-forge keras

For in my case, rdkit installation took very long time, 20min or more…. I waited patiently.

After installation, check rdkit. Installed old version. But it does not matter.

from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)
> 2018.09.1

Then clone code from github repo, change current dir and download sample smiles. Keras-molecule uses git-lfs for data strage. So files which are stored is not real .h5 files. I skipped installation of git-lfs to colab.

!git clone https://github.com/iwatobipen/keras-molecules
os.chdir('keras-molecules/')
# original data is stored with git-lft. So it can't get directly with git.
# So I uploaded small sample data to my repo.
!wget https://github.com/iwatobipen/playground/raw/master/smiles_50k.h5

Now ready, preprocess is needed at first. It is easy to do it.
After finished processed.h5 file will be stored in data/ .

!python preprocess.py smiles_50k.h5 data/processed.h5

Then load the data and do training. I did it on notebook, so following code is different from original README section.

import train
data_train, data_test, charset = train.load_dataset('data/processed.h5')
model = train.MoleculeVAE()
model.create(charset,  latent_rep_size=292)
checkpointer = train.ModelCheckpoint(filepath = 'model.h5',
                                   verbose = 1,
                                   save_best_only = True)

reduce_lr = train.ReduceLROnPlateau(monitor = 'val_loss',
                                  factor = 0.2,
                                  patience = 3,
                                  min_lr = 0.0001)
model.autoencoder.fit( data_train,
                  data_train,
                  shuffle = True,
                  epochs = 20,
                  batch_size = 600,
                  callbacks = [checkpointer, reduce_lr],
                  validation_data = (data_test, data_test)
            )

It took 20-30min for training on google colab with GPU. I don’t recommend run the code on PC which does not have GPU. I will take too long time for training.

After training, I ran interpolate. Interpolate samples molecules between SORUCE and DEST from latent space. Ideally the approach can change molecule gradually because latent space is continuous space. It sounds interesting.

import interpolate
h5f = interpolate.h5py.File('data/processed.h5' ,'r')
charset = list(h5f['charset'][:])
charset = [ x.decode('utf-8') for x in charset ]
h5f.close()

LATENT_DIM = 292
WIDTH = 120
model.load(charset, 'model.h5', latent_rep_size = LATENT_DIM)

I set two molecules for SOURCE and DEST.

SOURCE = 'CC2CCN(C(=O)CC#N)CC2N(C)c3ncnc1[nH]ccc13'
DEST = 'CCS(=O)(=O)N1CC(C1)(CC#N)N2C=C(C=N2)C3=C4C=CNC4=NC=N3'
STEPS = 10
results = interpolate.interpolate(SOURCE, DEST, STEPS, charset, model, LATENT_DIM, WIDTH)
for result in results:
  print(result[2])

The result was….

Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCcccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCCCccccccccccccc Cr1ccccCCCCCCCCCCCCCCCCCCCCCCCCCCcccccccccc Cr1CcccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCccccccc Cr1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCcccc CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Opps! All molecules are not molecules. It just almost C/c……

The model will be improved if I set more large number of epochs and give more learning data.

Naive VAE approach is difficult to generate valid molecules. Now there are many approaches about molecule generation are reported. JT-VAE, Grammer-VAE, etc, etc…. What kind of generator do you like?

Today’s code is redundant but uploaded my github repo.

https://github.com/iwatobipen/playground/blob/master/mol_vae.ipynb

Advertisements

Try GCN QSPR with pytorch based graph library #RDKit #Pytorch #dgl

Recently many machine learning articles use pytorch for their implementation. And I found very attractive package for graph based deep learning, named ‘DGL;Deep Graph Library’. The package supports pytorch and mxnet for backend. The author provides not only package but also very nice documentation. I read the document and try GCN for QSPR with DGL.

https://www.dgl.ai/pages/about.html

The package can make graph object from networkx and can convert graph object to networkx object.
Following code example is based their example about batched graph convolution.
https://docs.dgl.ai/tutorials/basics/4_batch.html
At first, import packages which will use.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from collections import namedtuple
import dgl
from dgl import DGLGraph
import dgl.function as fn

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, TensorDataset
from torch.utils.data import DataLoader
import torch.optim as optim

import networkx as nx
import copy
import os
from rdkit import Chem
from rdkit.Chem import RDConfig
import numpy as np

Next define mol2graph function. Following example only node features used bond features did not used.

# Following code borrowed from dgl's junction tree example.
ELEM_LIST = ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg', 'Na', 'Ca', 'Fe', 'Al', 'I', 'B', 'K', 'Se', 'Zn', 'H', 'Cu', 'Mn', 'unknown']

ATOM_FDIM = len(ELEM_LIST) + 6 + 5 + 1
MAX_ATOMNUM =60
BOND_FDIM = 5 
MAX_NB = 10

PAPER = os.getenv('PAPER', False)

def onek_encoding_unk(x, allowable_set):
    if x not in allowable_set:
        x = allowable_set[-1]
    return [x == s for s in allowable_set]

# Note that during graph decoding they don't predict stereochemistry-related
# characteristics (i.e. Chiral Atoms, E-Z, Cis-Trans).  Instead, they decode
# the 2-D graph first, then enumerate all possible 3-D forms and find the
# one with highest score.
'''
def atom_features(atom):
    return (torch.Tensor(onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()]))
'''
def atom_features(atom):
    return (onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()])

def bond_features(bond):
    bt = bond.GetBondType()
    return (torch.Tensor([bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE, bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC, bond.IsInRing()]))

def mol2dgl_single(mols):
    cand_graphs = []
    n_nodes = 0
    n_edges = 0
    bond_x = []

    for mol in mols:
        n_atoms = mol.GetNumAtoms()
        n_bonds = mol.GetNumBonds()
        g = DGLGraph()        
        nodeF = []
        for i, atom in enumerate(mol.GetAtoms()):
            assert i == atom.GetIdx()
            nodeF.append(atom_features(atom))
        g.add_nodes(n_atoms)

        bond_src = []
        bond_dst = []
        for i, bond in enumerate(mol.GetBonds()):
            a1 = bond.GetBeginAtom()
            a2 = bond.GetEndAtom()
            begin_idx = a1.GetIdx()
            end_idx = a2.GetIdx()
            features = bond_features(bond)

            bond_src.append(begin_idx)
            bond_dst.append(end_idx)
            bond_x.append(features)
            bond_src.append(end_idx)
            bond_dst.append(begin_idx)
            bond_x.append(features)
        g.add_edges(bond_src, bond_dst)
        g.ndata['h'] = torch.Tensor(nodeF)
        cand_graphs.append(g)
    return cand_graphs

Next defined the original collate function for data loader. And defined msg and reduce function.
msg function get message from neighbor node and reduce function aggregates the massages.

msg = fn.copy_src(src="h", out="m")
def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)
def reduce(nodes):
    # summazation by avarage is different part
    accum = torch.mean(nodes.mailbox['m'], 1)
    return {'h': accum}

Then defined the network. By using the dgl user can easily access node and features. For example graph.ndata[‘name’] method can access node features named ‘name’. NodeApplyModule is used for calculation of each node.
It is worth to know that by using torch nn.ModuleList, user can write network like ‘Keras’.

class NodeApplyModule(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(NodeApplyModule, self).__init__()
        self.linear = nn.Linear(in_feats, out_feats)
        self.activation = activation
    
    def forward(self, node):
        h = self.linear(node.data['h'])
        h = self.activation(h)
        return {'h': h}
    

class GCN(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(GCN, self).__init__()
        self.apply_mod = NodeApplyModule(in_feats, out_feats, activation)
    
    def forward(self, g, feature):
        g.ndata['h'] = feature
        g.update_all(msg, reduce)
        g.apply_nodes(func=self.apply_mod)
        h =  g.ndata.pop('h')
        #print(h.shape)
        return h
    
class Classifier(nn.Module):
    def __init__(self, in_dim, hidden_dim, n_classes):
        super(Classifier, self).__init__()
        self.layers = nn.ModuleList([GCN(in_dim, hidden_dim, F.relu),
                                    GCN(hidden_dim, hidden_dim, F.relu)])
        self.classify = nn.Linear(hidden_dim, n_classes)
    def forward(self, g):
        h = g.ndata['h']
        for conv in self.layers:
            h = conv(g, h)
        g.ndata['h'] = h
        hg = dgl.mean_nodes(g, 'h')
        return self.classify(hg)

Let’s load data. I used solubility dataset in RDKit

solcls = {'(A) low':0, '(B) medium':1, '(C) high':2}
train_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.train.sdf'))]
train_y = [solcls[m.GetProp('SOL_classification')] for m in train_mols]
test_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.test.sdf'))]
test_y = [solcls[m.GetProp('SOL_classification')] for m in test_mols]
train_graphs = mol2dgl_single(train_mols)
test_graphs = mol2dgl_single(test_mols)

dataset = list(zip(train_graphs, train_y))
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate)

Finally run train and check the performance.

model = Classifier(ATOM_FDIM, 256, len(solcls))
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()

epoch_losses = []
for epoch in range(200):
    epoch_loss = 0
    for i, (bg, label) in enumerate(data_loader):
        bg.set_e_initializer(dgl.init.zero_initializer)
        bg.set_n_initializer(dgl.init.zero_initializer)        
        pred = model(bg)
        loss = loss_func(pred, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.detach().item()
    epoch_loss /= (i + 1)
    if (epoch+1) % 20 == 0:
        print('Epoch {}, loss {:.4f}'.format(epoch+1, epoch_loss))
    epoch_losses.append(epoch_loss)

>Epoch 20, loss 0.6104
>Epoch 40, loss 0.5616
>Epoch 60, loss 0.5348
>Epoch 80, loss 0.5095
>Epoch 100, loss 0.4915
>Epoch 120, loss 0.5163
>Epoch 140, loss 0.5348
>Epoch 160, loss 0.4385
>Epoch 180, loss 0.4421
>Epoch 200, loss 0.4318
plt.plot(epoch_losses, c='b')
model.eval()
test_bg = dgl.batch(test_graphs)
test_y_tensor = torch.tensor(test_y).float().view(-1,1)
test_bg.set_e_initializer(dgl.init.zero_initializer)
test_bg.set_n_initializer(dgl.init.zero_initializer)
logit = model(test_bg)
probs = torch.softmax(logit, 1).detach().numpy()
pred_y = np.argmax(probs,1)

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
accuracy_score(test_y, pred_y)
>0.7587548638132295
print(classification_report(test_y, pred_y))
              precision    recall  f1-score   support

           0       0.70      0.86      0.78       102
           1       0.79      0.64      0.71       115
           2       0.87      0.82      0.85        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.79      0.78      0.78       257
weighted avg       0.77      0.76      0.76       257

Hmm not so bad. OK I tried random forest next. I would like to use RDKit new descriptor, ‘dragon-type descriptor’. So I used it ;)

https://github.com/rdkit/UGM_2018/blob/master/Notebooks/Landrum_Whats_New.ipynb

from rdkit.Chem import AllChem
from rdkit.Chem.Descriptors import rdMolDescriptors
from sklearn.preprocessing import normalize
# generate 3D conf
train_mols2 = copy.deepcopy(train_mols)
test_mols2 = copy.deepcopy(test_mols)

ps = AllChem.ETKDGv2()
for m in train_mols2:
    m = Chem.AddHs(m)
    AllChem.EmbedMolecule(m,ps)
for m in test_mols2:
    m = Chem.AddHs(m)
    AllChem.EmbedMolecule(m,ps)
def calc_dragon_type_desc(mol):
    return rdMolDescriptors.CalcAUTOCORR3D(mol) + rdMolDescriptors.CalcMORSE(mol) + \
        rdMolDescriptors.CalcRDF(mol) + rdMolDescriptors.CalcWHIM(mol)
train_X = normalize([calc_dragon_type_desc(m) for m in train_mols2])
test_X = normalize([calc_dragon_type_desc(m) for m in test_mols2])

For convenience I use only one conformer the above code.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(train_X, train_y)
rf_pred_y = rfc.predict(test_X)
accuracy_score(test_y, rf_pred_y)
>0.7587548638132295
print(classification_report(test_y, rf_pred_y))
              precision    recall  f1-score   support

           0       0.77      0.87      0.82       102
           1       0.79      0.66      0.72       115
           2       0.67      0.75      0.71        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.74      0.76      0.75       257
weighted avg       0.76      0.76      0.76       257

Random Forest showed same performance with GCN in this test.
DGL is very useful package for graph based deeplearning I think. Their original repo provides many example codes! Reader who is interested in pls check the repo.

I uploaded today’s code my git hub and it can check following URL.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/GCN_chemo.ipynb

mol encoder with Pytorch

Variable Auto Encoder (VAE) is unique method that is used for learning latent representations. VAE encodes discriminative vector to continuous vector in latent space. There are lots of examples in github.
In 2016, Alán Aspuru-Guzik reported new de novo design method by using VAE. The approach represents molecules as SMLIES and SMILES strings are converted one-hot vector. The vectors used for deep learning.
https://arxiv.org/abs/1610.02415
And Maxhodak implemented molencoder with Keras! It is nice work and the strong impact is given to the chemoinformatics area I think.
Repository is below.
https://github.com/maxhodak/keras-molecules

The code depends on keras 1.1.1 and python 2.7 and some packages. Recently keras version is 2.x and major code supports python 3.x. I would like to test mol VAE in python 3.x environment. Also I am now learning pytorch, so I would like to convert the code from keras based to pytorch based.
I spent several days for coding, but I found that it was pioneer!
Researcher who is Pande’s group published nice work.
https://github.com/cxhernandez/molencoder

The molencoder works on python3.5, 3.6, and CUDA-8.0 environments with pytorch. The code seems almost same as keras-molecules but there are some differences.
Main difference is activation function. The original code uses ReLU but the code uses SeLU for activation.
I have not check the effect of difference of the function.
I wrote interpolation script by using molencoder.
The code is below.

Description about some arguments.
SOUCE; The starting point of interpolation.
TARGET; The goal point of the experiment.
STEPS; How many steps the model will generate SMILES strings from latent space between source and target.
* Following code will run only with GPU machine because I hard-coded “someprop.cuda()” instead of using if function.

import numpy as np
import torch
import argparse
from torch.autograd import Variable
from rdkit import Chem
from molencoder.models import MolEncoder, MolDecoder
from molencoder.models import MolEncoder, MolDecoder
from molencoder.utils import( load_dataset, initialize_weights,ReduceLROnPlateau, save_checkpoint, validate_model)
from molencoder.featurizers import Featurizer, OneHotFeaturizer


SOURCE = 'c1ccccn1'
DEST =  'c1ccccc1'
STEPS = 200
#charset from chembl
WIDTH=120

charset = [' ', '#', '%', '(', ')', '+', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', '@', 'A', 'B', 'C', 'F', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'S', 'T', 'V', 'X', 'Z', '[', '\\', ']', 'a', 'c', 'e', 'g', 'i', 'l', 'n', 'o', 'p', 'r', 's', 't']

def decode_smiles_from_index(vec, charset):
    return "".join(map(lambda x:charset[x],vec)).strip()


def get_arguments():
    parser = argparse.ArgumentParser(description="Interpolate from source to dest in steps")
    parser.add_argument("--source", type=str, default=DEST)
    parser.add_argument("--dest", type=str, default=SOURCE)
    parser.add_argument("--steps", type=int, default=STEPS)
    return parser.parse_args()

def interpolate(source, dest, steps, charset, encoder, decoder):
    width=WIDTH
    source_just = source.ljust(width)
    dest_just = dest.ljust(width)
    onehot = OneHotFeaturizer(charset=charset)
    sourcevec = onehot.featurize(smiles=[source_just])
    destvec = onehot.featurize(smiles=[dest_just])
    source_encoded = Variable(torch.from_numpy(sourcevec).float()).cuda()
    dest_encoded = Variable(torch.from_numpy(destvec).float()).cuda()
    source_x_latent = encoder(source_encoded)
    dest_x_latent = encoder(dest_encoded)
    step = (dest_x_latent-source_x_latent)/float(steps)
    results = []
    for i in range(steps):
        item = source_x_latent + (step*i)
        sampled = np.argmax(decoder(item).cpu().data.numpy(), axis=2)
        #print(sampled)
        decode_smiles = decode_smiles_from_index(sampled[0], charset)
        results.append((i, item, decode_smiles))
    return results

def main():
    args= get_arguments()
    encoder = MolEncoder( c = len(charset))
    decoder = MolDecoder( c = len(charset))
    encoder.apply(initialize_weights)
    decoder.apply(initialize_weights)
    
    print( torch.cuda.is_available() )
    encoder = MolEncoder( c = len(charset)).cuda()
    encoder.apply(initialize_weights)
    
    decoder = MolDecoder( c = len(charset)).cuda()
    decoder.apply(initialize_weights)
    
    bestmodel = torch.load("model_best.pth.tar")
    #bestmodel = torch.load("tempcheckpoint.pth.tar")
    encoder.load_state_dict(bestmodel["encoder"])
    decoder.load_state_dict(bestmodel["decoder"])

    results = interpolate( args.source, args.dest, args.steps, charset, encoder, decoder )
    for result in results:
        print(result[0], result[2])

if __name__=="__main__":
    main()

Just a note for me.

Build QSAR model with pytorch and rdkit #RDKit

There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch.
I have tried Keras, Chainer and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit.
You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer.
I used solubility data that is provided from rdkit and I used the dataset before.

Let’s start coding.
At first I imported package that is needed for QSAR and defined some utility functions.

import pprint
import argparse
import torch
import torch.optim as optim
from torch import nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
#from sklearn import preprocessing


def base_parser():
    parser = argparse.ArgumentParser("This is simple test of pytorch")
    parser.add_argument("trainset", help="sdf for train")
    parser.add_argument("testset", help="sdf for test")
    parser.add_argument("--epochs", default=150)
    return parser

parser = base_parser()
args = parser.parse_args()
traindata = [mol for mol in Chem.SDMolSupplier(args.trainset) if mol is not None]
testdata = [mol for mol in Chem.SDMolSupplier(args.testset) if mol is not None]

def molsfeaturizer(mols):
    fps = []
    for mol in mols:
        arr = np.zeros((0,))
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        DataStructs.ConvertToNumpyArray(fp, arr)
        fps.append(arr)
    fps = np.array(fps, dtype = np.float)
    return fps

classes = {"(A) low":0, "(B) medium":1, "(C) high":2}
#classes = {"(A) low":0, "(B) medium":1, "(C) high":1}

trainx = molsfeaturizer(traindata)
testx = molsfeaturizer(testdata)
# for pytorch, y must be long type!!
trainy = np.array([classes[mol.GetProp("SOL_classification")] for mol in traindata], dtype=np.int64)
testy = np.array([classes[mol.GetProp("SOL_classification")] for mol in testdata], dtype=np.int64)

torch.from_numpy function can convert numpy array to torch tensor. It is very convenient for us.
And then I defined neural network. I feel this method is very unique because I mostly use Keras for deep learning.
To build the model in pytorch, I need define the each layer and whole structure.

X_train = torch.from_numpy(trainx)
X_test = torch.from_numpy(testx)
Y_train = torch.from_numpy(trainy)
Y_test = torch.from_numpy(testy)
print(X_train.size(),Y_train.size())
print(X_test.size(), Y_train.size())

class QSAR_mlp(nn.Module):
    def __init__(self):
        super(QSAR_mlp, self).__init__()
        self.fc1 = nn.Linear(2048, 524)
        self.fc2 = nn.Linear(524, 10)
        self.fc3 = nn.Linear(10, 10)
        self.fc4 = nn.Linear(10,3)
    def forward(self, x):
        x = x.view(-1, 2048)
        h1 = F.relu(self.fc1(x))
        h2 = F.relu(self.fc2(h1))
        h3 = F.relu(self.fc3(h2))
        output = F.sigmoid(self.fc4(h3))
        return output

After defining the model I tried to lean and prediction.
Following code is training and prediction parts.

model = QSAR_mlp()
print(model)

losses = []
optimizer = optim.Adam( model.parameters(), lr=0.005)
for epoch in range(args.epochs):
    data, target = Variable(X_train).float(), Variable(Y_train).long()
    optimizer.zero_grad()
    y_pred = model(data)
    loss = F.cross_entropy(y_pred, target)
    print("Loss: {}".format(loss.data[0]))
    loss.backward()
    optimizer.step()

pred_y = model(Variable(X_test).float())
predicted = torch.max(pred_y, 1)[1]

for i in range(len(predicted)):
    print("pred:{}, target:{}".format(predicted.data[i], Y_test[i]))

print( "Accuracy: {}".format(sum(p==t for p,t in zip(predicted.data, Y_test))/len(Y_test)))

Check the code.

iwatobipen$ python qsar_pytorch.py solubility.train.sdf solubility.test.sdf 
torch.Size([1025, 2048]) torch.Size([1025])
torch.Size([257, 2048]) torch.Size([1025])
QSAR_mlp(
  (fc1): Linear(in_features=2048, out_features=524)
  (fc2): Linear(in_features=524, out_features=10)
  (fc3): Linear(in_features=10, out_features=10)
  (fc4): Linear(in_features=10, out_features=3)
)
Loss: 1.1143544912338257
-snip-
Loss: 0.6231405735015869
pred:1, target:0
pred:1, target:0
-snip-
pred:0, target:0
Accuracy: 0.642023346303502

Hmm, accuracy is not so high. I believe there is still room for improvement. I am newbie of pytorch. I will try to practice pytorch next year.

This is my last code of this year. I would like to post my blog more in next year.
If readers who find mistake in my code, please let me know.

Have a happy new year !!!!
;-)

Can machine learn important feature from SMILES?

Today I found challenging article in arxiv.
It describes about SMILES2Vec. https://arxiv.org/pdf/1712.02034.pdf
You know word2vec is very attractive and major application for ML area and SMILES2Vec has same concept.
It converts smiles to vector and learn which character is important. The author use “black box” models for building model. I am not sure about “black box” model but I think it likes leave one out. The method masks some features, builds model and finds important features.

To use the method, SMILES2Vec can find important characters in the given smiles.
They found CNN-GRU model gives best result for solubility prediction. My question is … Why convolution of SMILES work fine???
My opinion is that solubility or logP depends on the presence or absence of substituents such as hydroxyl or amino groups, they do not strongly depend on the position some case. So I think the method is interesting but difficult to predict biological affinity.

SMILES strings is major input format for deep learning area. Also I often use SMILES. ;-) But I want to find another format for DL.

ref for black box model
https://arxiv.org/pdf/1602.07043.pdf

molecule encoder/decoder in deepchem #rdkit #deepchem

Today I updated deepchem in my mac.
It was easy to install new version of deepchem on Mac.

iwatobipen$ git clone https://github.com/deepchem/deepchem.git
iwatobipen$ cd deepchem
iwatobipen$ bash scripts/install_deepchem_conda.sh

That’s all. ;-)

New version of deepchem is implemented MoleculeVAE. MoeculeVAE generates new molecules by using pre defined model.
Deepchem can use pre defined model that was trained with Zinc Dataset.
OK let’s run the code.
I tested moleculeVAE by reference to following web page.
https://www.deepchem.io/_modules/deepchem/models/autoencoder_models/test_tensorflowEncoders.html

Deepchem provides lots of useful function for data preparation.
For example, convert smiles to one hot vector and vise versa.
I used cdk2.sdf for structure generation.

from __future__ import print_function
import os
from rdkit import Chem, RDConfig
from rdkit.Chem import Draw
import deepchem as dc
from deepchem.models.autoencoder_models.autoencoder import TensorflowMoleculeEncoder, TensorflowMoleculeDecoder
from deepchem.feat.one_hot import zinc_charset
from deepchem.data import DiskDataset

datadir = os.path.join( RDConfig.RDDocsDir, 'Book/data/cdk2.sdf' )

mols = [ mol for mol in Chem.SDMolSupplier( datadir ) ]
smiles = [ Chem.MolToSmiles( mol ) for mol in mols ]
print( len( smiles ))

tf_encoder = TensorflowMoleculeEncoder.zinc_encoder()
tf_decoder = TensorflowMoleculeDecoder.zinc_decoder()

featurizer = dc.feat.one_hot.OneHotFeaturizer( zinc_charset, 120 )

# default setting ; encode smiles to one_hot vector and padding to 120 character.
features = featurizer( mols )
print( features.shape )

dataset = DiskDataset.from_numpy( features, features )
prediction = tf_encoder.predict_on_batch( dataset.X )
one_hot_dec = tf_decoder.predict_on_batch( prediction )
decoded_smiles = featurizer.untransform( one_hot_dec )

for smiles in decoded_smiles:
    print( smiles[0] )
    print( Chem.MolFromSmiles( smiles[0] ))
mols = [ Chem.MolFromSmiles( smi[0] ) for smi in decoded_smiles ]
im = Draw.MolsToGridImage( mols )
im.save( 'res.png' )

And results was …. ;-(

iwatobipen$ python molVAE.py
/Users/iwatobipen/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Using TensorFlow backend.
47
2017-10-10 22:25:23.051035: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-10 22:25:23.051056: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-10 22:25:23.051060: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-10 22:25:23.051064: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
(47, 120, 35)
TIMING: dataset construction took 0.059 s
Loading dataset from disk.
CCCCNC(=O)CCn1c(=O)ncc[n1)ccn2
[22:25:29] SMILES Parse Error: syntax error for input: 'CCCCNC(=O)CCn1c(=O)ncc[n1)ccn2'
None
CC(C))CCN1CCCC1)c2nc[nH+]nn2
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C))CCN1CCCC1)c2nc[nH+]nn2'
None
CC(C)(C)C(CCC(=O)NCCc1cc[nH+]cn1
[22:25:29] SMILES Parse Error: extra open parentheses for input: 'CC(C)(C)C(CCC(=O)NCCc1cc[nH+]cn1'
None
CC(C)(C)N1CCCCC1)c2nc[nH+]nn2
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(C)N1CCCCC1)c2nc[nH+]nn2'
None
CC(C)CCNCCCC)CN=C)c1cc[nH+]cn1
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)CCNCCCC)CN=C)c1cc[nH+]cn1'
None
Cc1ccnnc1SCc2ccccc2Crc2NCCC))F)C
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1ccnnc1SCc2ccccc2Crc2NCCC))F)C'
None
CC((C)))C(=O)NCc1ccccc1)c(/c(=C)C(C(=])C
[22:25:29] SMILES Parse Error: syntax error for input: 'CC((C)))C(=O)NCc1ccccc1)c(/c(=C)C(C(=])C'
None
Cc1ccnn1CCC(=O)N((C)(C)Ccc2[n(ccn2)CCC)(C)C
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1ccnn1CCC(=O)N((C)(C)Ccc2[n(ccn2)CCC)(C)C'
None
CC(CC)(C)CC(C)(C)CCNCCC(CCC)N
<rdkit.Chem.rdchem.Mol object at 0x10079c210>
COc1ccc2c(c1)CC)Cc2cc((cn=O)C(=O)N3
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'COc1ccc2c(c1)CC)Cc2cc((cn=O)C(=O)N3'
None
Cc1ncsc(=O)n1C)CC(=O)Nc2ccss[N/C(=O)[C)[O-])C
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1ncsc(=O)n1C)CC(=O)Nc2ccss[N/C(=O)[C)[O-])C'
None
Cc1c(c(=O)n2c(n1)C(=O)CC(C)C)C)C
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1c(c(=O)n2c(n1)C(=O)CC(C)C)C)C'
None
CN(C))(O)CNN(Cc1cccn1+c2cc[nH]c2c2N
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CN(C))(O)CNN(Cc1cccn1+c2cc[nH]c2c2N'
None
CC11ccc2n1c(ccnH=)Nc3cc((c(c3=O)C)NCC2OO
[22:25:29] SMILES Parse Error: syntax error for input: 'CC11ccc2n1c(ccnH=)Nc3cc((c(c3=O)C)NCC2OO'
None
CCNH+]1CCc2cnc2c1c(n(c3=))c4cc(cc(cc3C)C))CC1=O
[22:25:29] SMILES Parse Error: syntax error for input: 'CCNH+]1CCc2cnc2c1c(n(c3=))c4cc(cc(cc3C)C))CC1=O'
None
CC(=O)Nc1cc(ccc1OC)CCc2c3c([nH+nn2)cccco3
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(=O)Nc1cc(ccc1OC)CCc2c3c([nH+nn2)cccco3'
None
COc1ccc2c(n1)C)C)ccnc3ccccc3)NCC=)))C(=O)CC
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'COc1ccc2c(n1)C)C)ccnc3ccccc3)NCC=)))C(=O)CC'
None
CCS(=O)(=O)c1ccc(c(c1)C)2CC=c3cc(ccc3)NC(C)C)C(=O)N2
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCS(=O)(=O)c1ccc(c(c1)C)2CC=c3cc(ccc3)NC(C)C)C(=O)N2'
None
CC(=O)Nc1cccccc1OC)CCc2c3c([nH+]n23cccccc4
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(=O)Nc1cccccc1OC)CCc2c3c([nH+]n23cccccc4'
None
Cc1cccc2c11[nH]nc(c2))CCC(=O)N(C3)CCc3ccc(H+]c)C
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1cccc2c11[nH]nc(c2))CCC(=O)N(C3)CCc3ccc(H+]c)C'
None
Cc1cc(cc211[nH]cc(c2)CCCC(=O)CC(=O)NC3CCCn4ccc(=O)n4C
[22:25:29] SMILES Parse Error: extra open parentheses for input: 'Cc1cc(cc211[nH]cc(c2)CCCC(=O)CC(=O)NC3CCCn4ccc(=O)n4C'
None
Ccc1ccc2ccnn))n3c2ccc1)c4ccccc4F)
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Ccc1ccc2ccnn))n3c2ccc1)c4ccccc4F)'
None
Cc1nc(c2c(nn))n3c2ccc3)S4ccs2)C(C)(C[NH+]CCC))C)))n
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1nc(c2c(nn))n3c2ccc3)S4ccs2)C(C)(C[NH+]CCC))C)))n'
None
CC(CCc1ccncc1)Cc2cccc3c2NN3CCNCC3=O
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CC(CCc1ccncc1)Cc2cccc3c2NN3CCNCC3=O'
None
CC(=O)(=OOc1ccn1C(=C)C(=O)N)2cc(=O)(c2=O)())
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(=O)(=OOc1ccn1C(=C)C(=O)N)2cc(=O)(c2=O)())'
None
CNS(=O)(=O)c(ccn1C((C)C(=O)Nc2ccc(cc2Cl)F
[22:25:29] SMILES Parse Error: syntax error for input: 'CNS(=O)(=O)c(ccn1C((C)C(=O)Nc2ccc(cc2Cl)F'
None
CCCC)CCC(=O)(=O)c1cO)1C(CC)C(=O)Nc2ccccc2F)/s1
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCCC)CCC(=O)(=O)c1cO)1C(CC)C(=O)Nc2ccccc2F)/s1'
None
CC((Cc1c(=O)nnc(=O)C(=O)CC(C)))c2ccccc2C2=O
[22:25:29] SMILES Parse Error: syntax error for input: 'CC((Cc1c(=O)nnc(=O)C(=O)CC(C)))c2ccccc2C2=O'
None
c1cs=c1C(=O)N(C2(CCCC2))c3c[nH]c(=O)])
[22:25:29] SMILES Parse Error: syntax error for input: 'c1cs=c1C(=O)N(C2(CCCC2))c3c[nH]c(=O)])'
None
CS(=O)(=O)c1cc(1CCc=O)NCC2CCC(CC2)c3c[nH+]c(=+)n3
[22:25:29] SMILES Parse Error: syntax error for input: 'CS(=O)(=O)c1cc(1CCc=O)NCC2CCC(CC2)c3c[nH+]c(=+)n3'
None
CN(Cc1ccccc1)c2nc(nnn+]c2c(c2SCc4ccccc4
[22:25:29] SMILES Parse Error: syntax error for input: 'CN(Cc1ccccc1)c2nc(nnn+]c2c(c2SCc4ccccc4'
None
CS(=O)(=O)c1cc==O)2c(n3c2cccc1SCC(=O)NO)))c(cc)/(([O)on
[22:25:29] SMILES Parse Error: syntax error for input: 'CS(=O)(=O)c1cc==O)2c(n3c2cccc1SCC(=O)NO)))c(cc)/(([O)on'
None
CN(C)C(=O)CC(C(=O)[O-])C(=O)CSc1ccc[nH+]c1=N
[22:25:29] Can't kekulize mol.  Unkekulized atoms: 14 15 16 17 18

None
Cc1c=O(c=n1Cc2cccn2)c3c(n2)CC((((=C)C)])CC-/)
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1c=O(c=n1Cc2cccn2)c3c(n2)CC((((=C)C)])CC-/)'
None
CC(C)(CC))C(=O)NCc1cccnc1Cl)CCc(=O)(=))CC///n2
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(CC))C(=O)NCc1cccnc1Cl)CCc(=O)(=))CC///n2'
None
CC(C)(CCN)C(=O)CSc1cc(=O)c(=CC))]))))))))S2/c2)=O)C))/s)
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(CCN)C(=O)CSc1cc(=O)c(=CC))]))))))))S2/c2)=O)C))/s)'
None
CCN1CC11C(C2)C(=O)N2C2N(C2S((=O)NC3CC2)C(=O)OCC
[22:25:29] SMILES Parse Error: syntax error for input: 'CCN1CC11C(C2)C(=O)N2C2N(C2S((=O)NC3CC2)C(=O)OCC'
None
CCN=N)c1cc=1Cc==)(=O)C)(C))C)CC2CCCCCC2)))())
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCN=N)c1cc=1Cc==)(=O)C)(C))C)CC2CCCCCC2)))())'
None
CC(CCc2ccccc2)C[=O)c3c(=O)n(c)+n3)c3nc(nc))c1
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(CCc2ccccc2)C[=O)c3c(=O)n(c)+n3)c3nc(nc))c1'
None
CCS(=O)(=O)CS(=O)NC1CC)C(=O)Nc1ccc2c1cccc1F))n1
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCS(=O)(=O)CS(=O)NC1CC)C(=O)Nc1ccc2c1cccc1F))n1'
None
CC1CCc2c(s2c2nc(n(=O)n)NC(=O)c3cccc(c3)SS(=O)=OO)C1
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CC1CCc2c(s2c2nc(n(=O)n)NC(=O)c3cccc(c3)SS(=O)=OO)C1'
None
C[NH+]1CCCC(=O)[O-])CCc1c(=O)oc2c2c2cccc2C(s=])CCCC2=O
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'C[NH+]1CCCC(=O)[O-])CCc1c(=O)oc2c2c2cccc2C(s=])CCCC2=O'
None
Cc1cn(c(=O)c1C)CC(=O)Nc2ccss[N]S(=O)N)C(O)C)3(CCCC(C)\C)//)))C
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1cn(c(=O)c1C)CC(=O)Nc2ccss[N]S(=O)N)C(O)C)3(CCCC(C)\C)//)))C'
None
C[NH]11CCCC(=O)[O-])NC(=O)CN1C)c2ccn12)C((O)/O))C)CCCC)2)C1
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'C[NH]11CCCC(=O)[O-])NC(=O)CN1C)c2ccn12)C((O)/O))C)CCCC)2)C1'
None
CCN1CC11(CCN)n(c(=NCc2ccccc2))4ccn23C4CCCC4=O2CCC
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CCN1CC11(CCN)n(c(=NCc2ccccc2))4ccn23C4CCCC4=O2CCC'
None
CC(C=()Scc1n[nH]c2c(n1)CCCC(=O)CC(=O)N3CC[NH+](CC3)CCc4cccon)))
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(C=()Scc1n[nH]c2c(n1)CCCC(=O)CC(=O)N3CC[NH+](CC3)CCc4cccon)))'
None
CC1CCc2c(c3cccccn2)C[H+]ccc(1O((((=+]3)Cc4ccss4Cl)1
[22:25:29] SMILES Parse Error: syntax error for input: 'CC1CCc2c(c3cccccn2)C[H+]ccc(1O((((=+]3)Cc4ccss4Cl)1'
None
[22:25:29] SMILES Parse Error: syntax error for input: 'CCCCNC(=O)CCn1c(=O)ncc[n1)ccn2'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C))CCN1CCCC1)c2nc[nH+]nn2'
[22:25:29] SMILES Parse Error: extra open parentheses for input: 'CC(C)(C)C(CCC(=O)NCCc1cc[nH+]cn1'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(C)N1CCCCC1)c2nc[nH+]nn2'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)CCNCCCC)CN=C)c1cc[nH+]cn1'
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1ccnnc1SCc2ccccc2Crc2NCCC))F)C'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC((C)))C(=O)NCc1ccccc1)c(/c(=C)C(C(=])C'
[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1ccnn1CCC(=O)N((C)(C)Ccc2[n(ccn2)CCC)(C)C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'COc1ccc2c(c1)CC)Cc2cc((cn=O)C(=O)N3'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1ncsc(=O)n1C)CC(=O)Nc2ccss[N/C(=O)[C)[O-])C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1c(c(=O)n2c(n1)C(=O)CC(C)C)C)C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CN(C))(O)CNN(Cc1cccn1+c2cc[nH]c2c2N'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC11ccc2n1c(ccnH=)Nc3cc((c(c3=O)C)NCC2OO'
[22:25:29] SMILES Parse Error: syntax error for input: 'CCNH+]1CCc2cnc2c1c(n(c3=))c4cc(cc(cc3C)C))CC1=O'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(=O)Nc1cc(ccc1OC)CCc2c3c([nH+nn2)cccco3'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'COc1ccc2c(n1)C)C)ccnc3ccccc3)NCC=)))C(=O)CC'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCS(=O)(=O)c1ccc(c(c1)C)2CC=c3cc(ccc3)NC(C)C)C(=O)N2'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(=O)Nc1cccccc1OC)CCc2c3c([nH+]n23cccccc4'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1cccc2c11[nH]nc(c2))CCC(=O)N(C3)CCc3ccc(H+]c)C'
[22:25:29] SMILES Parse Error: extra open parentheses for input: 'Cc1cc(cc211[nH]cc(c2)CCCC(=O)CC(=O)NC3CCCn4ccc(=O)n4C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Ccc1ccc2ccnn))n3c2ccc1)c4ccccc4F)'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1nc(c2c(nn))n3c2ccc3)S4ccs2)C(C)(C[NH+]CCC))C)))n'
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CC(CCc1ccncc1)Cc2cccc3c2NN3CCNCC3=O'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(=O)(=OOc1ccn1C(=C)C(=O)N)2cc(=O)(c2=O)())'
[22:25:29] SMILES Parse Error: syntax error for input: 'CNS(=O)(=O)c(ccn1C((C)C(=O)Nc2ccc(cc2Cl)F'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCCC)CCC(=O)(=O)c1cO)1C(CC)C(=O)Nc2ccccc2F)/s1'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC((Cc1c(=O)nnc(=O)C(=O)CC(C)))c2ccccc2C2=O'
[22:25:29] SMILES Parse Error: syntax error for input: 'c1cs=c1C(=O)N(C2(CCCC2))c3c[nH]c(=O)])'
[22:25:29] SMILES Parse Error: syntax error for input: 'CS(=O)(=O)c1cc(1CCc=O)NCC2CCC(CC2)c3c[nH+]c(=+)n3'
[22:25:29] SMILES Parse Error: syntax error for input: 'CN(Cc1ccccc1)c2nc(nnn+]c2c(c2SCc4ccccc4'
[22:25:29] SMILES Parse Error: syntax error for input: 'CS(=O)(=O)c1cc==O)2c(n3c2cccc1SCC(=O)NO)))c(cc)/(([O)on'
[22:25:29] Can't kekulize mol.  Unkekulized atoms: 14 15 16 17 18

[22:25:29] SMILES Parse Error: syntax error for input: 'Cc1c=O(c=n1Cc2cccn2)c3c(n2)CC((((=C)C)])CC-/)'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(CC))C(=O)NCc1cccnc1Cl)CCc(=O)(=))CC///n2'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CC(C)(CCN)C(=O)CSc1cc(=O)c(=CC))]))))))))S2/c2)=O)C))/s)'
[22:25:29] SMILES Parse Error: syntax error for input: 'CCN1CC11C(C2)C(=O)N2C2N(C2S((=O)NC3CC2)C(=O)OCC'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCN=N)c1cc=1Cc==)(=O)C)(C))C)CC2CCCCCC2)))())'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(CCc2ccccc2)C[=O)c3c(=O)n(c)+n3)c3nc(nc))c1'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'CCS(=O)(=O)CS(=O)NC1CC)C(=O)Nc1ccc2c1cccc1F))n1'
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CC1CCc2c(s2c2nc(n(=O)n)NC(=O)c3cccc(c3)SS(=O)=OO)C1'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'C[NH+]1CCCC(=O)[O-])CCc1c(=O)oc2c2c2cccc2C(s=])CCCC2=O'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'Cc1cn(c(=O)c1C)CC(=O)Nc2ccss[N]S(=O)N)C(O)C)3(CCCC(C)\C)//)))C'
[22:25:29] SMILES Parse Error: extra close parentheses for input: 'C[NH]11CCCC(=O)[O-])NC(=O)CN1C)c2ccn12)C((O)/O))C)CCCC)2)C1'
[22:25:29] SMILES Parse Error: unclosed ring for input: 'CCN1CC11(CCN)n(c(=NCc2ccccc2))4ccn23C4CCCC4=O2CCC'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC(C=()Scc1n[nH]c2c(n1)CCCC(=O)CC(=O)N3CC[NH+](CC3)CCc4cccon)))'
[22:25:29] SMILES Parse Error: syntax error for input: 'CC1CCc2c(c3cccccn2)C[H+]ccc(1O((((=+]3)Cc4ccss4Cl)1'
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x12a606128>>
Traceback (most recent call last):
  File "/Users/iwatobipen/.pyenv/versions/anaconda3-2.4.0/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 701, in __del__
TypeError: 'NoneType' object is not callable

Hmm…. I could not get suitable structure from Molecule autoencoder.
It has difficulty for molecule generator because structure of input data based on SMILES strings.  Now ratio of invalid smiles were high. But I think DeepChem and rdkit show nice combination for chemoinformatics.

Graph convolution classification with deepchem

I posted about graph convolution regression using deepchem. And today, I tried graph convolution classification using deepchem.
Code is almost same as regression model. The only a difference point is use dc.models.MultitaskGraphClassifier instead of dc.models.MultitaskGraphRegressor.
I got sample ( JAK3 inhibitor ) data from chembl and tried to make model.

At first I used pandas to convert activity class ( active, non active )to 0,1 bit. Easy to do it.

import panda as pd
import pandas as pd
df = pd.read_table('jak3_chembl.txt', header=0)
df['activity_class'] = pd.factorize( df.ACTIVITY_COMMENT )
pd.factorize( df.ACTIVITY_COMMENT )
len(pd.factorize( df.ACTIVITY_COMMENT ))
df['activity_class'] = pd.factorize( df.ACTIVITY_COMMENT )[0]

df.to_csv('./preprocessed_jak3.csv', index=False)

Next wrote model and test it.

import tensorflow as tf
import deepchem as dc
import numpy as np
import pandas as pd

graph_featurizer = dc.feat.graph_features.ConvMolFeaturizer()
loader = dc.data.data_loader.CSVLoader( tasks=['activity_class'], smiles_field="CANONICAL_SMILES", id_field="CMPD_CHEMBLID", featurizer=graph_featurizer )
dataset = loader.featurize( './preprocessed_jak3.csv' )

splitter = dc.splits.splitters.RandomSplitter()
trainset,testset = splitter.train_test_split( dataset )

hp = dc.molnet.preset_hyper_parameters
param = hp.hps[ 'graphconv' ]
print(param['batch_size'])
g = tf.Graph()
graph_model = dc.nn.SequentialGraph( 75 )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), 75, activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.GraphConv( int(param['n_filters']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphPool() )
graph_model.add( dc.nn.Dense( int(param['n_fully_connected_nodes']), int(param['n_filters']), activation='relu' ))
graph_model.add( dc.nn.BatchNormalization( epsilon=1e-5, mode=1 ))
graph_model.add( dc.nn.GraphGather( 10 , activation='tanh'))

with tf.Session() as sess:
    model_graphconv = dc.models.MultitaskGraphClassifier( graph_model,
                                                      1,
                                                      75,
                                                     batch_size=10,
                                                     learning_rate = param['learning_rate'],
                                                     optimizer_type = 'adam',
                                                     beta1=.9,beta2=.999)
    model_graphconv.fit( trainset, nb_epoch=50 )

train_scores = {}
#regression_metric = dc.metrics.Metric( dc.metrics.pearson_r2_score, np.mean )
classification_metric = dc.metrics.Metric( dc.metrics.roc_auc_score, np.mean )
train_scores['graphconvreg'] = model_graphconv.evaluate( trainset,[ classification_metric ]  )
p=model_graphconv.predict( testset )

for i in range( len(p )):
    print( p[i], testset.y[i] )

print(train_scores) 

root@08d8f729f78b:/deepchem/pen_test# python graphconv_jak3.py

And datalog file is….

Loading raw samples now.
shard_size: 8192
About to start loading CSV from ./preprocessed_jak3.csv
Loading shard 1 of size 8192.
Featurizing sample 0
TIMING: featurizing shard 0 took 2.023 s
TIMING: dataset construction took 3.830 s
Loading dataset from disk.
TIMING: dataset construction took 2.263 s
Loading dataset from disk.
TIMING: dataset construction took 1.147 s
Loading dataset from disk.
50
Training for 50 epochs
Starting epoch 0
On batch 0
...............
On batch 0
On batch 50
computed_metrics: [0.97176380945032259]
{'graphconvreg': {'mean-roc_auc_score': 0.97176380945032259}}

Not so bad.
Classification model gives better result than regression model.
All code is pushed my github repository.
https://github.com/iwatobipen/deeplearning