Particle Swarm Optimization for molecular design #RDKit #Chemoinformatics

I participated RDKit UGM last week. It was worth to go I think. And in the meeting I got useful information for de novo molecular design. You can find the slide deck following URL. https://github.com/rdkit/UGM_2019/blob/master/Presentations/Montanari_Winter_Utilizing_in_silico_models.pdf

They used Particle Swarm Optimization(PSO) for de novo molecular design. PSO is very simple method for parameter optimization. Details are described in wikipedia.

The algorithm of PSO is similar to Q-learning. PSO try to improve objective function by updating velocity during the iteration.

Fortunately, PSO algorithm for molecular generation is disclosed in github. ;)

So I installed mso from github and tried to use it.

At first I installed cddd because mso is depended with cddd. cddd encode molecules to latent space and decode latent space to molecules.
. In the readme of cddd, tensorflow==1.10 is required but it worked tensorflow==1.13. Then installed mso. Both library can install from github.

$ git clone https://github.com/jrwnter/cddd.git
$ cd cddd
$ pip install -e .
$ cd ../
$ git clone https://github.com/jrwnter/mso.git
$ cd mso
$ pip install -e .

After installing the package I used mso. To convenience, I used pretrained model which is provided from cddd repo. Downloaded default model from following URL and stored ./cddd/data folder and unzip it. URL: https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h

Now ready, let’s try.

MSO optimize particle positions with objective functions. The package have many functions. Such as QED, substructure and alert, etc.
Multiple combination is scoring function is also available.

Following example is simple. At first, I used substructure and QED function. Substructure function return 1 if generated molecule has user defined structure. It is useful because RNN based generator often generates molecule randomly so it is difficult to keep specific substructure.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

from mso.optimizer import BasePSOptimizer
from mso.objectives.scoring import ScoringFunction
from mso.objectives.mol_functions import qed_score
from mso.objectives.mol_functions import sa_score
from mso.objectives.mol_functions import substructure_match_score
from functools import partial
from cddd.inference import InferenceModel

sub_match = partial(substructure_match_score, query=Chem.MolFromSmiles('c1ccncc1'))


init_smiles = "c1c(C)cccc1" 
scoring_functions = [ScoringFunction(func=qed_score, name="qed", is_mol_func=True), ScoringFunction(func=sub_match, name='subst', is_mol_func=True)]

substructure_match_score function requires several arguments including substructure. To use it, partial function is used for freeze several args. And then pass functions to ScoringFunction class.

infermodel = InferenceModel()
opt = BasePSOptimizer.from_query(
    init_smiles=init_smiles,
    num_part=200,
    num_swarms=1,
    inference_model=infermodel,
    scoring_functions=scoring_functions)

Then defined optimizer instance. num_part shows number of particles which means number of molecules. And finally run the optimization.

res = opt.run(20)
res0 = res[0]

res0 has best scored smiles and other generated smiles. I retrieve them and draw. Fitness is normalized score.


mols = []
for idx, smi in enumerate(res0.smiles):
    
    try:
        mol = Chem.MolFromSmiles(smi)
        Chem.SanitizeMol(mol)
        if mol != None and res0.fitness[idx] > 0.7:
            print(smi)
            mols.append(mol)
    except:
        pass

OK, generated molecules has use defined substructure. However it seems similar compounds, does it depend on training set?

opt instance has fitness history as pandas dataframe. Fortunately rdkit can PandasTools. ;)


from IPython.display import HTML
from rdkit.Chem import PandasTools
PandasTools.AddMoleculeColumnToFrame(opt.best_fitness_history, smilesCol='smiles')
HTML(opt.best_fitness_history.to_html())

It’s very easy. MSO optimizer does not use GPU, so the process works very fast if user has many CPUs.

MSO seems useful package for molecular generation. User can also define your own objective functions. I use it more deeply.

Today’s code is uploaded to gist.

Transfer learning of DGMG for focused library gegneration #DGL #Chemoinformatics

Transfer learning is very useful method in deeplearning. Because it can use pre trained model and can re train with few parameters.

I think it is useful for molecular generator too. If it is useful for the generator, it can use for focused library generation. I posted about DGL molecular generation. So I tried to apply transfer learning with DGL.

At first, import several packages for coding.

import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import RDPaths
from rdkit.Chem import Draw
from dgl.data.chem import utils
from dgl.model_zoo.chem import pretrain
from dgl.model_zoo.chem.dgmg import MoleculeEnv
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
import copy
mols = Chem.SDMolSupplier(f"{RDPaths.RDDocsDir}/Book/data/cdk2.sdf")
model = pretrain.load_pretrained('DGMG_ChEMBL_canonical')

Then made three copy of the DGMG model.

model1 = copy.deepcopy(model)
model2 = copy.deepcopy(model)
model3 = copy.deepcopy(model)

Down load utility function for chemical structure handling from DGL original repository.

!wget https://raw.githubusercontent.com/dmlc/dgl/master/examples/pytorch/model_zoo/chem/generative_models/dgmg/utils.py
!wget https://raw.githubusercontent.com/dmlc/dgl/master/examples/pytorch/model_zoo/chem/generative_models/dgmg/sascorer.py

Freeze upper layer and last choose_dest_agent layer set requires grad True.

from utils import MoleculeDataset
params = model1.parameters()
for idx, param in enumerate(params):
    param.requires_grad = False

params_ = model1.choose_dest_agent.parameters()
for idx, param in enumerate(params_):
    param.requires_grad = True
    
params = model2.parameters()
for idx, param in enumerate(params):
    param.requires_grad = False

params_ = model2.choose_dest_agent.parameters()
for idx, param in enumerate(params_):
    param.requires_grad = True

params = model3.parameters()
for idx, param in enumerate(params):
    param.requires_grad = False

params_ = model3.choose_dest_agent.parameters()
for idx, param in enumerate(params_):
    param.requires_grad = True

For convenience, each model learning only 10 epochs and check generated structure. At frist, check default output.

genmols = []
i = 0
while i < 20:
    SMILES = model(rdkit_mol=True)
    if Chem.MolFromSmiles(SMILES) is not None:
        genmols.append(Chem.MolFromSmiles(SMILES))
        i += 1
Draw.MolsToGridImage(genmols)/sourcecode]
<!-- /wp:shortcode -->

<!-- wp:image {"id":2814,"sizeSlug":"large"} -->
<figure class="wp-block-image size-large"><img src="https://iwatobipen.files.wordpress.com/2019/09/def1.png?w=439" alt="" class="wp-image-2814" /></figure>
<!-- /wp:image -->

<!-- wp:paragraph -->
<p>Defined cdk2 molecules, linear and cyclic molecules for additional learning.</p>
<!-- /wp:paragraph -->

<!-- wp:shortcode -->

atom_types = ['O', 'Cl', 'C', 'S', 'F', 'Br', 'N']
bond_types = [Chem.rdchem.BondType.SINGLE,
                               Chem.rdchem.BondType.DOUBLE,
                               Chem.rdchem.BondType.TRIPLE]
env = MoleculeEnv(atom_types, bond_types)
from utils import Subset
from utils import Optimizer
subs1 = Subset([Chem.MolToSmiles(mol) for mol in mols], 'canonical', env)
subs2 = Subset(['C1NCOC1' for _ in range(10)], 'canonical', env)
subs3 = Subset(['CNCOCC' for _ in range(10)], 'canonical', env)
loader1  = DataLoader(subs1, 1)
loader2  = DataLoader(subs2, 1)
loader3  = DataLoader(subs3, 1)

First trial is CDK2 molecules.

optimizer = Optimizer(0.1, Adam(model1.parameters(), 0.1))
model1.train()
for i in range(10):
    for data in loader1:
        optimizer.zero_grad()
        logp = model1(data, compute_log_prob=True)
        loss_averaged = - logp
        optimizer.backward_and_step(loss_averaged)
model1.eval()
genmols = []
i = 0
while i < 20:
    SMILES = model1(rdkit_mol=True)
    if Chem.MolFromSmiles(SMILES) is not None:
        genmols.append(Chem.MolFromSmiles(SMILES))
        i += 1
from rdkit.Chem import Draw
Draw.MolsToGridImage(genmols)

Hmm it seems bicyclic compound is more generated….?

Then learn cyclic molecule.

optimizer = Optimizer(0.1, Adam(model2.parameters(), 0.1))
model2.train()
for i in range(10):
    for data in loader2:
        optimizer.zero_grad()
        logp = model2(data, compute_log_prob=True)
        loss_averaged = - logp
        optimizer.backward_and_step(loss_averaged)
model2.eval()
genmols = []
i = 0
while i < 20:
    SMILES = model2(rdkit_mol=True)
    if Chem.MolFromSmiles(SMILES) is not None:
        genmols.append(Chem.MolFromSmiles(SMILES))
        i += 1
from rdkit.Chem import Draw
Draw.MolsToGridImage(genmols)

As expected, cyclic molecules are generated.

Finally let’s check linear molecule as learning data.

optimizer = Optimizer(0.1, Adam(model3.parameters(), 0.1))
model3.train()
for i in range(10):
    for data in loader3:
        optimizer.zero_grad()
        logp = model3(data, compute_log_prob=True)
        loss_averaged = - logp
        optimizer.backward_and_step(loss_averaged)
model3.eval()
genmols = []
i = 0
while i < 20:
    SMILES = model3(rdkit_mol=True)
    if Chem.MolFromSmiles(SMILES) is not None:
        genmols.append(Chem.MolFromSmiles(SMILES))
        i += 1
from rdkit.Chem import Draw
Draw.MolsToGridImage(genmols)

Wow many linear molecules are generated.

This is very simple example for transfer learning. I think it is useful method because by using the method, user don’t need to learn huge number of parameters.

My code is not so efficient because I don’t fully understand how to learn dgmg.

I would like to read document more deeply.

Reader who has interest the code, you can find whole code from following url.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/transf.ipynb

Molecular encoder/decoder (VAE) with python3.x (not new topic) #rdkit #chemoinformatics

The first day of 10-day holiday is rainy. And I and my kid will go to dodge ball tournament.

Two years ago, I tried to modify the keras-molecule which is code of molecular encoder decoder. The code is written for python 2.x. So I would like to run the code on python 3.6. I stopped the modification because I often use pytorch and found molencoder for pytorch. https://github.com/cxhernandez/molencoder

Today I tried to modify keras-molecule again.
I think almost done and the code will run on python 3.x, with keras 2.x tensorflow backend.

https://github.com/iwatobipen/keras-molecules

I checked the code on google colab. Because google colab provides GPU environment. It is useful for deep learning. At first, I installed conda and several packages which are used in keras-molecule.

Following code was written on google colab.

# install conda
!wget https://repo.continuum.io/miniconda/Miniconda3-4.4.10-Linux-x86_64.sh
!chmod +x ./Miniconda3-4.4.10-Linux-x86_64.sh
!time bash ./Miniconda3-4.4.10-Linux-x86_64.sh -b -f -p /usr/local

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
# install packages via conda command
!time conda install -y -c conda-forge rdkit
!conda install -y -c conda-forge h5py
!conda install -y -c conda-forge scikit-learn
!conda install -y -c conda-forge pytables
!conda install -y -c conda-forge progressbar
!conda install -y -c conda-forge pandas
!conda install -y -c conda-forge keras

For in my case, rdkit installation took very long time, 20min or more…. I waited patiently.

After installation, check rdkit. Installed old version. But it does not matter.

from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)
> 2018.09.1

Then clone code from github repo, change current dir and download sample smiles. Keras-molecule uses git-lfs for data strage. So files which are stored is not real .h5 files. I skipped installation of git-lfs to colab.

!git clone https://github.com/iwatobipen/keras-molecules
os.chdir('keras-molecules/')
# original data is stored with git-lft. So it can't get directly with git.
# So I uploaded small sample data to my repo.
!wget https://github.com/iwatobipen/playground/raw/master/smiles_50k.h5

Now ready, preprocess is needed at first. It is easy to do it.
After finished processed.h5 file will be stored in data/ .

!python preprocess.py smiles_50k.h5 data/processed.h5

Then load the data and do training. I did it on notebook, so following code is different from original README section.

import train
data_train, data_test, charset = train.load_dataset('data/processed.h5')
model = train.MoleculeVAE()
model.create(charset,  latent_rep_size=292)
checkpointer = train.ModelCheckpoint(filepath = 'model.h5',
                                   verbose = 1,
                                   save_best_only = True)

reduce_lr = train.ReduceLROnPlateau(monitor = 'val_loss',
                                  factor = 0.2,
                                  patience = 3,
                                  min_lr = 0.0001)
model.autoencoder.fit( data_train,
                  data_train,
                  shuffle = True,
                  epochs = 20,
                  batch_size = 600,
                  callbacks = [checkpointer, reduce_lr],
                  validation_data = (data_test, data_test)
            )

It took 20-30min for training on google colab with GPU. I don’t recommend run the code on PC which does not have GPU. I will take too long time for training.

After training, I ran interpolate. Interpolate samples molecules between SORUCE and DEST from latent space. Ideally the approach can change molecule gradually because latent space is continuous space. It sounds interesting.

import interpolate
h5f = interpolate.h5py.File('data/processed.h5' ,'r')
charset = list(h5f['charset'][:])
charset = [ x.decode('utf-8') for x in charset ]
h5f.close()

LATENT_DIM = 292
WIDTH = 120
model.load(charset, 'model.h5', latent_rep_size = LATENT_DIM)

I set two molecules for SOURCE and DEST.

SOURCE = 'CC2CCN(C(=O)CC#N)CC2N(C)c3ncnc1[nH]ccc13'
DEST = 'CCS(=O)(=O)N1CC(C1)(CC#N)N2C=C(C=N2)C3=C4C=CNC4=NC=N3'
STEPS = 10
results = interpolate.interpolate(SOURCE, DEST, STEPS, charset, model, LATENT_DIM, WIDTH)
for result in results:
  print(result[2])

The result was….

Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCcccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCccccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCcccccccccccccc Cr1cccccCCCCCCCCCCCCCCCCCCCCCccccccccccccc Cr1ccccCCCCCCCCCCCCCCCCCCCCCCCCCCcccccccccc Cr1CcccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCccccccc Cr1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCcccc CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC CC1CccCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Opps! All molecules are not molecules. It just almost C/c……

The model will be improved if I set more large number of epochs and give more learning data.

Naive VAE approach is difficult to generate valid molecules. Now there are many approaches about molecule generation are reported. JT-VAE, Grammer-VAE, etc, etc…. What kind of generator do you like?

Today’s code is redundant but uploaded my github repo.

https://github.com/iwatobipen/playground/blob/master/mol_vae.ipynb

Try GCN QSPR with pytorch based graph library #RDKit #Pytorch #dgl

Recently many machine learning articles use pytorch for their implementation. And I found very attractive package for graph based deep learning, named ‘DGL;Deep Graph Library’. The package supports pytorch and mxnet for backend. The author provides not only package but also very nice documentation. I read the document and try GCN for QSPR with DGL.

https://www.dgl.ai/pages/about.html

The package can make graph object from networkx and can convert graph object to networkx object.
Following code example is based their example about batched graph convolution.
https://docs.dgl.ai/tutorials/basics/4_batch.html
At first, import packages which will use.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from collections import namedtuple
import dgl
from dgl import DGLGraph
import dgl.function as fn

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, TensorDataset
from torch.utils.data import DataLoader
import torch.optim as optim

import networkx as nx
import copy
import os
from rdkit import Chem
from rdkit.Chem import RDConfig
import numpy as np

Next define mol2graph function. Following example only node features used bond features did not used.

# Following code borrowed from dgl's junction tree example.
ELEM_LIST = ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg', 'Na', 'Ca', 'Fe', 'Al', 'I', 'B', 'K', 'Se', 'Zn', 'H', 'Cu', 'Mn', 'unknown']

ATOM_FDIM = len(ELEM_LIST) + 6 + 5 + 1
MAX_ATOMNUM =60
BOND_FDIM = 5 
MAX_NB = 10

PAPER = os.getenv('PAPER', False)

def onek_encoding_unk(x, allowable_set):
    if x not in allowable_set:
        x = allowable_set[-1]
    return [x == s for s in allowable_set]

# Note that during graph decoding they don't predict stereochemistry-related
# characteristics (i.e. Chiral Atoms, E-Z, Cis-Trans).  Instead, they decode
# the 2-D graph first, then enumerate all possible 3-D forms and find the
# one with highest score.
'''
def atom_features(atom):
    return (torch.Tensor(onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()]))
'''
def atom_features(atom):
    return (onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()])

def bond_features(bond):
    bt = bond.GetBondType()
    return (torch.Tensor([bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE, bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC, bond.IsInRing()]))

def mol2dgl_single(mols):
    cand_graphs = []
    n_nodes = 0
    n_edges = 0
    bond_x = []

    for mol in mols:
        n_atoms = mol.GetNumAtoms()
        n_bonds = mol.GetNumBonds()
        g = DGLGraph()        
        nodeF = []
        for i, atom in enumerate(mol.GetAtoms()):
            assert i == atom.GetIdx()
            nodeF.append(atom_features(atom))
        g.add_nodes(n_atoms)

        bond_src = []
        bond_dst = []
        for i, bond in enumerate(mol.GetBonds()):
            a1 = bond.GetBeginAtom()
            a2 = bond.GetEndAtom()
            begin_idx = a1.GetIdx()
            end_idx = a2.GetIdx()
            features = bond_features(bond)

            bond_src.append(begin_idx)
            bond_dst.append(end_idx)
            bond_x.append(features)
            bond_src.append(end_idx)
            bond_dst.append(begin_idx)
            bond_x.append(features)
        g.add_edges(bond_src, bond_dst)
        g.ndata['h'] = torch.Tensor(nodeF)
        cand_graphs.append(g)
    return cand_graphs

Next defined the original collate function for data loader. And defined msg and reduce function.
msg function get message from neighbor node and reduce function aggregates the massages.

msg = fn.copy_src(src="h", out="m")
def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)
def reduce(nodes):
    # summazation by avarage is different part
    accum = torch.mean(nodes.mailbox['m'], 1)
    return {'h': accum}

Then defined the network. By using the dgl user can easily access node and features. For example graph.ndata[‘name’] method can access node features named ‘name’. NodeApplyModule is used for calculation of each node.
It is worth to know that by using torch nn.ModuleList, user can write network like ‘Keras’.

class NodeApplyModule(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(NodeApplyModule, self).__init__()
        self.linear = nn.Linear(in_feats, out_feats)
        self.activation = activation
    
    def forward(self, node):
        h = self.linear(node.data['h'])
        h = self.activation(h)
        return {'h': h}
    

class GCN(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(GCN, self).__init__()
        self.apply_mod = NodeApplyModule(in_feats, out_feats, activation)
    
    def forward(self, g, feature):
        g.ndata['h'] = feature
        g.update_all(msg, reduce)
        g.apply_nodes(func=self.apply_mod)
        h =  g.ndata.pop('h')
        #print(h.shape)
        return h
    
class Classifier(nn.Module):
    def __init__(self, in_dim, hidden_dim, n_classes):
        super(Classifier, self).__init__()
        self.layers = nn.ModuleList([GCN(in_dim, hidden_dim, F.relu),
                                    GCN(hidden_dim, hidden_dim, F.relu)])
        self.classify = nn.Linear(hidden_dim, n_classes)
    def forward(self, g):
        h = g.ndata['h']
        for conv in self.layers:
            h = conv(g, h)
        g.ndata['h'] = h
        hg = dgl.mean_nodes(g, 'h')
        return self.classify(hg)

Let’s load data. I used solubility dataset in RDKit

solcls = {'(A) low':0, '(B) medium':1, '(C) high':2}
train_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.train.sdf'))]
train_y = [solcls[m.GetProp('SOL_classification')] for m in train_mols]
test_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.test.sdf'))]
test_y = [solcls[m.GetProp('SOL_classification')] for m in test_mols]
train_graphs = mol2dgl_single(train_mols)
test_graphs = mol2dgl_single(test_mols)

dataset = list(zip(train_graphs, train_y))
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate)

Finally run train and check the performance.

model = Classifier(ATOM_FDIM, 256, len(solcls))
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()

epoch_losses = []
for epoch in range(200):
    epoch_loss = 0
    for i, (bg, label) in enumerate(data_loader):
        bg.set_e_initializer(dgl.init.zero_initializer)
        bg.set_n_initializer(dgl.init.zero_initializer)        
        pred = model(bg)
        loss = loss_func(pred, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.detach().item()
    epoch_loss /= (i + 1)
    if (epoch+1) % 20 == 0:
        print('Epoch {}, loss {:.4f}'.format(epoch+1, epoch_loss))
    epoch_losses.append(epoch_loss)

>Epoch 20, loss 0.6104
>Epoch 40, loss 0.5616
>Epoch 60, loss 0.5348
>Epoch 80, loss 0.5095
>Epoch 100, loss 0.4915
>Epoch 120, loss 0.5163
>Epoch 140, loss 0.5348
>Epoch 160, loss 0.4385
>Epoch 180, loss 0.4421
>Epoch 200, loss 0.4318
plt.plot(epoch_losses, c='b')
model.eval()
test_bg = dgl.batch(test_graphs)
test_y_tensor = torch.tensor(test_y).float().view(-1,1)
test_bg.set_e_initializer(dgl.init.zero_initializer)
test_bg.set_n_initializer(dgl.init.zero_initializer)
logit = model(test_bg)
probs = torch.softmax(logit, 1).detach().numpy()
pred_y = np.argmax(probs,1)

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
accuracy_score(test_y, pred_y)
>0.7587548638132295
print(classification_report(test_y, pred_y))
              precision    recall  f1-score   support

           0       0.70      0.86      0.78       102
           1       0.79      0.64      0.71       115
           2       0.87      0.82      0.85        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.79      0.78      0.78       257
weighted avg       0.77      0.76      0.76       257

Hmm not so bad. OK I tried random forest next. I would like to use RDKit new descriptor, ‘dragon-type descriptor’. So I used it ;)

https://github.com/rdkit/UGM_2018/blob/master/Notebooks/Landrum_Whats_New.ipynb

from rdkit.Chem import AllChem
from rdkit.Chem.Descriptors import rdMolDescriptors
from sklearn.preprocessing import normalize
# generate 3D conf
train_mols2 = copy.deepcopy(train_mols)
test_mols2 = copy.deepcopy(test_mols)

ps = AllChem.ETKDGv2()
for m in train_mols2:
    m = Chem.AddHs(m)
    AllChem.EmbedMolecule(m,ps)
for m in test_mols2:
    m = Chem.AddHs(m)
    AllChem.EmbedMolecule(m,ps)
def calc_dragon_type_desc(mol):
    return rdMolDescriptors.CalcAUTOCORR3D(mol) + rdMolDescriptors.CalcMORSE(mol) + \
        rdMolDescriptors.CalcRDF(mol) + rdMolDescriptors.CalcWHIM(mol)
train_X = normalize([calc_dragon_type_desc(m) for m in train_mols2])
test_X = normalize([calc_dragon_type_desc(m) for m in test_mols2])

For convenience I use only one conformer the above code.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(train_X, train_y)
rf_pred_y = rfc.predict(test_X)
accuracy_score(test_y, rf_pred_y)
>0.7587548638132295
print(classification_report(test_y, rf_pred_y))
              precision    recall  f1-score   support

           0       0.77      0.87      0.82       102
           1       0.79      0.66      0.72       115
           2       0.67      0.75      0.71        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.74      0.76      0.75       257
weighted avg       0.76      0.76      0.76       257

Random Forest showed same performance with GCN in this test.
DGL is very useful package for graph based deeplearning I think. Their original repo provides many example codes! Reader who is interested in pls check the repo.

I uploaded today’s code my git hub and it can check following URL.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/GCN_chemo.ipynb

mol encoder with Pytorch

Variable Auto Encoder (VAE) is unique method that is used for learning latent representations. VAE encodes discriminative vector to continuous vector in latent space. There are lots of examples in github.
In 2016, Alán Aspuru-Guzik reported new de novo design method by using VAE. The approach represents molecules as SMLIES and SMILES strings are converted one-hot vector. The vectors used for deep learning.
https://arxiv.org/abs/1610.02415
And Maxhodak implemented molencoder with Keras! It is nice work and the strong impact is given to the chemoinformatics area I think.
Repository is below.
https://github.com/maxhodak/keras-molecules

The code depends on keras 1.1.1 and python 2.7 and some packages. Recently keras version is 2.x and major code supports python 3.x. I would like to test mol VAE in python 3.x environment. Also I am now learning pytorch, so I would like to convert the code from keras based to pytorch based.
I spent several days for coding, but I found that it was pioneer!
Researcher who is Pande’s group published nice work.
https://github.com/cxhernandez/molencoder

The molencoder works on python3.5, 3.6, and CUDA-8.0 environments with pytorch. The code seems almost same as keras-molecules but there are some differences.
Main difference is activation function. The original code uses ReLU but the code uses SeLU for activation.
I have not check the effect of difference of the function.
I wrote interpolation script by using molencoder.
The code is below.

Description about some arguments.
SOUCE; The starting point of interpolation.
TARGET; The goal point of the experiment.
STEPS; How many steps the model will generate SMILES strings from latent space between source and target.
* Following code will run only with GPU machine because I hard-coded “someprop.cuda()” instead of using if function.

import numpy as np
import torch
import argparse
from torch.autograd import Variable
from rdkit import Chem
from molencoder.models import MolEncoder, MolDecoder
from molencoder.models import MolEncoder, MolDecoder
from molencoder.utils import( load_dataset, initialize_weights,ReduceLROnPlateau, save_checkpoint, validate_model)
from molencoder.featurizers import Featurizer, OneHotFeaturizer


SOURCE = 'c1ccccn1'
DEST =  'c1ccccc1'
STEPS = 200
#charset from chembl
WIDTH=120

charset = [' ', '#', '%', '(', ')', '+', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', '@', 'A', 'B', 'C', 'F', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'S', 'T', 'V', 'X', 'Z', '[', '\\', ']', 'a', 'c', 'e', 'g', 'i', 'l', 'n', 'o', 'p', 'r', 's', 't']

def decode_smiles_from_index(vec, charset):
    return "".join(map(lambda x:charset[x],vec)).strip()


def get_arguments():
    parser = argparse.ArgumentParser(description="Interpolate from source to dest in steps")
    parser.add_argument("--source", type=str, default=DEST)
    parser.add_argument("--dest", type=str, default=SOURCE)
    parser.add_argument("--steps", type=int, default=STEPS)
    return parser.parse_args()

def interpolate(source, dest, steps, charset, encoder, decoder):
    width=WIDTH
    source_just = source.ljust(width)
    dest_just = dest.ljust(width)
    onehot = OneHotFeaturizer(charset=charset)
    sourcevec = onehot.featurize(smiles=[source_just])
    destvec = onehot.featurize(smiles=[dest_just])
    source_encoded = Variable(torch.from_numpy(sourcevec).float()).cuda()
    dest_encoded = Variable(torch.from_numpy(destvec).float()).cuda()
    source_x_latent = encoder(source_encoded)
    dest_x_latent = encoder(dest_encoded)
    step = (dest_x_latent-source_x_latent)/float(steps)
    results = []
    for i in range(steps):
        item = source_x_latent + (step*i)
        sampled = np.argmax(decoder(item).cpu().data.numpy(), axis=2)
        #print(sampled)
        decode_smiles = decode_smiles_from_index(sampled[0], charset)
        results.append((i, item, decode_smiles))
    return results

def main():
    args= get_arguments()
    encoder = MolEncoder( c = len(charset))
    decoder = MolDecoder( c = len(charset))
    encoder.apply(initialize_weights)
    decoder.apply(initialize_weights)
    
    print( torch.cuda.is_available() )
    encoder = MolEncoder( c = len(charset)).cuda()
    encoder.apply(initialize_weights)
    
    decoder = MolDecoder( c = len(charset)).cuda()
    decoder.apply(initialize_weights)
    
    bestmodel = torch.load("model_best.pth.tar")
    #bestmodel = torch.load("tempcheckpoint.pth.tar")
    encoder.load_state_dict(bestmodel["encoder"])
    decoder.load_state_dict(bestmodel["decoder"])

    results = interpolate( args.source, args.dest, args.steps, charset, encoder, decoder )
    for result in results:
        print(result[0], result[2])

if __name__=="__main__":
    main()

Just a note for me.

Build QSAR model with pytorch and rdkit #RDKit

There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch.
I have tried Keras, Chainer and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit.
You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer.
I used solubility data that is provided from rdkit and I used the dataset before.

Let’s start coding.
At first I imported package that is needed for QSAR and defined some utility functions.

import pprint
import argparse
import torch
import torch.optim as optim
from torch import nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
#from sklearn import preprocessing


def base_parser():
    parser = argparse.ArgumentParser("This is simple test of pytorch")
    parser.add_argument("trainset", help="sdf for train")
    parser.add_argument("testset", help="sdf for test")
    parser.add_argument("--epochs", default=150)
    return parser

parser = base_parser()
args = parser.parse_args()
traindata = [mol for mol in Chem.SDMolSupplier(args.trainset) if mol is not None]
testdata = [mol for mol in Chem.SDMolSupplier(args.testset) if mol is not None]

def molsfeaturizer(mols):
    fps = []
    for mol in mols:
        arr = np.zeros((0,))
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        DataStructs.ConvertToNumpyArray(fp, arr)
        fps.append(arr)
    fps = np.array(fps, dtype = np.float)
    return fps

classes = {"(A) low":0, "(B) medium":1, "(C) high":2}
#classes = {"(A) low":0, "(B) medium":1, "(C) high":1}

trainx = molsfeaturizer(traindata)
testx = molsfeaturizer(testdata)
# for pytorch, y must be long type!!
trainy = np.array([classes[mol.GetProp("SOL_classification")] for mol in traindata], dtype=np.int64)
testy = np.array([classes[mol.GetProp("SOL_classification")] for mol in testdata], dtype=np.int64)

torch.from_numpy function can convert numpy array to torch tensor. It is very convenient for us.
And then I defined neural network. I feel this method is very unique because I mostly use Keras for deep learning.
To build the model in pytorch, I need define the each layer and whole structure.

X_train = torch.from_numpy(trainx)
X_test = torch.from_numpy(testx)
Y_train = torch.from_numpy(trainy)
Y_test = torch.from_numpy(testy)
print(X_train.size(),Y_train.size())
print(X_test.size(), Y_train.size())

class QSAR_mlp(nn.Module):
    def __init__(self):
        super(QSAR_mlp, self).__init__()
        self.fc1 = nn.Linear(2048, 524)
        self.fc2 = nn.Linear(524, 10)
        self.fc3 = nn.Linear(10, 10)
        self.fc4 = nn.Linear(10,3)
    def forward(self, x):
        x = x.view(-1, 2048)
        h1 = F.relu(self.fc1(x))
        h2 = F.relu(self.fc2(h1))
        h3 = F.relu(self.fc3(h2))
        output = F.sigmoid(self.fc4(h3))
        return output

After defining the model I tried to lean and prediction.
Following code is training and prediction parts.

model = QSAR_mlp()
print(model)

losses = []
optimizer = optim.Adam( model.parameters(), lr=0.005)
for epoch in range(args.epochs):
    data, target = Variable(X_train).float(), Variable(Y_train).long()
    optimizer.zero_grad()
    y_pred = model(data)
    loss = F.cross_entropy(y_pred, target)
    print("Loss: {}".format(loss.data[0]))
    loss.backward()
    optimizer.step()

pred_y = model(Variable(X_test).float())
predicted = torch.max(pred_y, 1)[1]

for i in range(len(predicted)):
    print("pred:{}, target:{}".format(predicted.data[i], Y_test[i]))

print( "Accuracy: {}".format(sum(p==t for p,t in zip(predicted.data, Y_test))/len(Y_test)))

Check the code.

iwatobipen$ python qsar_pytorch.py solubility.train.sdf solubility.test.sdf 
torch.Size([1025, 2048]) torch.Size([1025])
torch.Size([257, 2048]) torch.Size([1025])
QSAR_mlp(
  (fc1): Linear(in_features=2048, out_features=524)
  (fc2): Linear(in_features=524, out_features=10)
  (fc3): Linear(in_features=10, out_features=10)
  (fc4): Linear(in_features=10, out_features=3)
)
Loss: 1.1143544912338257
-snip-
Loss: 0.6231405735015869
pred:1, target:0
pred:1, target:0
-snip-
pred:0, target:0
Accuracy: 0.642023346303502

Hmm, accuracy is not so high. I believe there is still room for improvement. I am newbie of pytorch. I will try to practice pytorch next year.

This is my last code of this year. I would like to post my blog more in next year.
If readers who find mistake in my code, please let me know.

Have a happy new year !!!!
;-)

Can machine learn important feature from SMILES?

Today I found challenging article in arxiv.
It describes about SMILES2Vec. https://arxiv.org/pdf/1712.02034.pdf
You know word2vec is very attractive and major application for ML area and SMILES2Vec has same concept.
It converts smiles to vector and learn which character is important. The author use “black box” models for building model. I am not sure about “black box” model but I think it likes leave one out. The method masks some features, builds model and finds important features.

To use the method, SMILES2Vec can find important characters in the given smiles.
They found CNN-GRU model gives best result for solubility prediction. My question is … Why convolution of SMILES work fine???
My opinion is that solubility or logP depends on the presence or absence of substituents such as hydroxyl or amino groups, they do not strongly depend on the position some case. So I think the method is interesting but difficult to predict biological affinity.

SMILES strings is major input format for deep learning area. Also I often use SMILES. ;-) But I want to find another format for DL.

ref for black box model
https://arxiv.org/pdf/1602.07043.pdf