Use pytorch for QSAR model building more simply like scikit-learn #pytorch #chemoinformatics #RDKit

I often use pytorch for deep learning framework. I like pytorch because it is very flexible and many recent articles are used it for their implementation.

But to build model and train the model, I need to define training method. So it seems nice if I can train pytorch model just calling fit like scikit-learn doesn’t it?

Fortunately by using skorch, it will be enable training process to be simple.

Sktorch is a scikit-learn compatible network library that waps pytorch. Wow it’s cool! Skorch can be installed via pip command.

I installed it and then tried to build QSAR model with skorch.
At first, import packages and make dataset for training and test. For classification task, class index should be float32. float64 will cause Error.

import os
from rdkit import Chem
from rdkit import RDPaths
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')
prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2

def fp2arr(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

nbits = 1024
ncls = 3

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

train_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits) for m in trainmols]
train_X = np.array([fp2arr(fp) for fp in train_fps], dtype=np.float32)

train_y = np.array([ohencoder(prop_dict[m.GetProp('SOL_classification')], ncls) for m in trainmols])
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits) for m in testmols]
test_X = np.array([fp2arr(fp) for fp in test_fps], dtype=np.float32)

test_y = np.array([ohencoder(prop_dict[m.GetProp('SOL_classification')], ncls) for m in testmols])
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

Then import pytorch and skorch. I checked whether GPU is available or not.

import torch
if torch.cuda.is_available():
    print('use GPU')
    print('use CPU')

from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

Next, define simple multi-layer perceptron model. Following example is multilabel classification so I used softmax layer at the last layer.

class SimpleMLP(nn.Module):
    def __init__(self, num_units=512, nonlin=F.relu):
        super(SimpleMLP, self).__init__()
        self.dense0 = nn.Linear(nbits, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, ncls)
    def forward(self, x, **kwargs):
        x = self.nonlin(self.dense0(x))
        x = self.dropout(x)
        x = F.relu(self.dense1(x))
        x = F.softmax(self.output(x), dim=-1)
        return x

Next, make NeuralNetClassifier with defined model. It is very easy to use GPU. If you pass the device=’cuda’ option to the object, that’s all! ;-) After that let’s train the model with fit method.

net = NeuralNetClassifier(SimpleMLP,
                         ), train_y)

 epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        1.1241       0.3902        1.1102  0.1851
      ----    snip---------
    100        0.0831       0.6146        1.0327  0.0251
    (dense0): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=512, out_features=10, bias=True)
    (output): Linear(in_features=10, out_features=3, bias=True)

Ok, let check the model performance.

pred = net.predict(test_X)
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(test_y, pred)
array([[83, 18,  1],
       [35, 72,  8],
       [ 0, 11, 29]])

print(classification_report(test_y, pred))

              precision    recall  f1-score   support

           0       0.70      0.81      0.75       102
           1       0.71      0.63      0.67       115
           2       0.76      0.72      0.74        40

    accuracy                           0.72       257
   macro avg       0.73      0.72      0.72       257
weighted avg       0.72      0.72      0.71       257

It seems work fine.

Then I tried to build regression model. NeuralNetRegressor is used to do it. The model structure is almost same as above but last layer is little bit different.

from skorch import NeuralNetRegressor
class MLP_regressor(nn.Module):
    def __init__(self, num_units=512, nonlin=F.relu):
        super(MLP_regressor, self).__init__()
        self.dense0 = nn.Linear(nbits, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, 1)
    def forward(self, x, **kwargs):
        x = self.nonlin(self.dense0(x))
        x = self.dropout(x)
        x = F.relu(self.dense1(x))
        x = self.output(x)
        return x

regressor = NeuralNetRegressor(MLP_regressor,

Then make dataset for regression task.

train_y_reg = np.array([float(m.GetProp('SOL')) for m in trainmols], dtype=np.float32).reshape(-1,1)
test_y_reg = np.array([float(m.GetProp('SOL')) for m in testmols], dtype=np.float32).reshape(-1,1), train_y_reg)
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1        8.3883       17.1863  0.0306
    100        0.3729        2.7019  0.0245
    (dense0): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=512, out_features=10, bias=True)
    (output): Linear(in_features=10, out_features=1, bias=True)

Check the model.

reg_y = regressor.predict(test_X)
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(np.arange(-9,2), np.arange(-9,2), color='pink', alpha=0.7)
plt.scatter(reg_y, test_y_reg)
from sklearn.metrics import r2_score
print(f"R2 score of testset was {r2_score(test_y_reg, reg_y):.2f}")
> R2 score of testset was 0.65

Current models have room for improvement because I didn’t optimize model structure and parameters. ;)

In summary skorch is very useful package for programmer who would like to write code for model building and training quickly.

I just uploaded today’s code on my githubrepo.

And it can run on binder.


Scaffold growing with RNN #RDKit #Pytorch #Chemoinformatics

My favorite molecular generator is REINVENT which is SMILES RNN based generator. Because it is very flexible and easy to modify.

And recently same group in Astrazeneca published new version of REINVENT, its title is SMILES-Based Deep Generative Scaffold Decorator for De-Novo Drug Design

It seems very exciting for me! Because there are many molecular generator in these days but there are few implementation for scaffold growing. Some approach uses conditional RNN or conditional graph based approach but it can’t specify the position of growing substituents. However their implementation can decorate user defined positions.

Fortunately the code is able to get from github!

Their approach has two model one is scaffold generator and the another is scaffold decorator. Scaffold decorator is trained with parts of compounds which pass the Rule of 3. It is important point because the idea prevents generation of too large undruggable compounds I think.

OK let’s test the code. At first, this code use pyspark so user need to install spark and pyspark at first. And pyspark work java version 8, not work version 11. Please check your java version. And another key point is that attention mechanism is used for decorator model. By using attention mechanism, the model can find the position where attach the substituents.

I think it is very reasonable and efficient approach for scaffold decoration.

You can read more details in original article so I skipped exprain session and go to code!

Are you ready? Let’s go.

Following code is almost same as in original repo. I used quinoline as an example scaffold and used sample trained model for convenience.

$> git clone 
$> cd reinvent-scaffold-decorator
$> echo "c1c([*:0])cc2c(c1)c([*:1])ccn2" > scaffold.smi
$> git clone 
$> cd reinvent-scaffold-decorator
$> echo "c1c([*:0])cc2c(c1)c([*:1])ccn2" > scaffold.smi
$> ./ -m drd2_decorator/models/model.trained.50 -i scaffold.smi -o generated_molecules.parquet -r 32 -n 32 -d multi

The code above will take few minutes in my env (GPU GTX 1650). GPU machine is recommended. After run the code generated molecules are stored as parquet file.

The file can read by using pyspark. The result image is below. You see, all decorated molecule has two substituents with defined positions. I posed scaffold and linker cutting method before. It seems interesting to apply the code I think ;)

And whole code can be found in my gist.

RDKit + deep learning framework is good friend for chemoinformatics! Let’s enjoy!

Try GCN QSPR with pytorch based graph library #RDKit #Pytorch #dgl

Recently many machine learning articles use pytorch for their implementation. And I found very attractive package for graph based deep learning, named ‘DGL;Deep Graph Library’. The package supports pytorch and mxnet for backend. The author provides not only package but also very nice documentation. I read the document and try GCN for QSPR with DGL.

The package can make graph object from networkx and can convert graph object to networkx object.
Following code example is based their example about batched graph convolution.
At first, import packages which will use.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from collections import namedtuple
import dgl
from dgl import DGLGraph
import dgl.function as fn

import torch
import torch.nn as nn
import torch.nn.functional as F
from import Dataset, TensorDataset
from import DataLoader
import torch.optim as optim

import networkx as nx
import copy
import os
from rdkit import Chem
from rdkit.Chem import RDConfig
import numpy as np

Next define mol2graph function. Following example only node features used bond features did not used.

# Following code borrowed from dgl's junction tree example.
ELEM_LIST = ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg', 'Na', 'Ca', 'Fe', 'Al', 'I', 'B', 'K', 'Se', 'Zn', 'H', 'Cu', 'Mn', 'unknown']

ATOM_FDIM = len(ELEM_LIST) + 6 + 5 + 1
MAX_NB = 10

PAPER = os.getenv('PAPER', False)

def onek_encoding_unk(x, allowable_set):
    if x not in allowable_set:
        x = allowable_set[-1]
    return [x == s for s in allowable_set]

# Note that during graph decoding they don't predict stereochemistry-related
# characteristics (i.e. Chiral Atoms, E-Z, Cis-Trans).  Instead, they decode
# the 2-D graph first, then enumerate all possible 3-D forms and find the
# one with highest score.
def atom_features(atom):
    return (torch.Tensor(onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()]))
def atom_features(atom):
    return (onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()])

def bond_features(bond):
    bt = bond.GetBondType()
    return (torch.Tensor([bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE, bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC, bond.IsInRing()]))

def mol2dgl_single(mols):
    cand_graphs = []
    n_nodes = 0
    n_edges = 0
    bond_x = []

    for mol in mols:
        n_atoms = mol.GetNumAtoms()
        n_bonds = mol.GetNumBonds()
        g = DGLGraph()        
        nodeF = []
        for i, atom in enumerate(mol.GetAtoms()):
            assert i == atom.GetIdx()

        bond_src = []
        bond_dst = []
        for i, bond in enumerate(mol.GetBonds()):
            a1 = bond.GetBeginAtom()
            a2 = bond.GetEndAtom()
            begin_idx = a1.GetIdx()
            end_idx = a2.GetIdx()
            features = bond_features(bond)

        g.add_edges(bond_src, bond_dst)
        g.ndata['h'] = torch.Tensor(nodeF)
    return cand_graphs

Next defined the original collate function for data loader. And defined msg and reduce function.
msg function get message from neighbor node and reduce function aggregates the massages.

msg = fn.copy_src(src="h", out="m")
def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)
def reduce(nodes):
    # summazation by avarage is different part
    accum = torch.mean(nodes.mailbox['m'], 1)
    return {'h': accum}

Then defined the network. By using the dgl user can easily access node and features. For example graph.ndata[‘name’] method can access node features named ‘name’. NodeApplyModule is used for calculation of each node.
It is worth to know that by using torch nn.ModuleList, user can write network like ‘Keras’.

class NodeApplyModule(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(NodeApplyModule, self).__init__()
        self.linear = nn.Linear(in_feats, out_feats)
        self.activation = activation
    def forward(self, node):
        h = self.linear(['h'])
        h = self.activation(h)
        return {'h': h}

class GCN(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(GCN, self).__init__()
        self.apply_mod = NodeApplyModule(in_feats, out_feats, activation)
    def forward(self, g, feature):
        g.ndata['h'] = feature
        g.update_all(msg, reduce)
        h =  g.ndata.pop('h')
        return h
class Classifier(nn.Module):
    def __init__(self, in_dim, hidden_dim, n_classes):
        super(Classifier, self).__init__()
        self.layers = nn.ModuleList([GCN(in_dim, hidden_dim, F.relu),
                                    GCN(hidden_dim, hidden_dim, F.relu)])
        self.classify = nn.Linear(hidden_dim, n_classes)
    def forward(self, g):
        h = g.ndata['h']
        for conv in self.layers:
            h = conv(g, h)
        g.ndata['h'] = h
        hg = dgl.mean_nodes(g, 'h')
        return self.classify(hg)

Let’s load data. I used solubility dataset in RDKit

solcls = {'(A) low':0, '(B) medium':1, '(C) high':2}
train_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.train.sdf'))]
train_y = [solcls[m.GetProp('SOL_classification')] for m in train_mols]
test_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.test.sdf'))]
test_y = [solcls[m.GetProp('SOL_classification')] for m in test_mols]
train_graphs = mol2dgl_single(train_mols)
test_graphs = mol2dgl_single(test_mols)

dataset = list(zip(train_graphs, train_y))
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate)

Finally run train and check the performance.

model = Classifier(ATOM_FDIM, 256, len(solcls))
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

epoch_losses = []
for epoch in range(200):
    epoch_loss = 0
    for i, (bg, label) in enumerate(data_loader):
        pred = model(bg)
        loss = loss_func(pred, label)
        epoch_loss += loss.detach().item()
    epoch_loss /= (i + 1)
    if (epoch+1) % 20 == 0:
        print('Epoch {}, loss {:.4f}'.format(epoch+1, epoch_loss))

>Epoch 20, loss 0.6104
>Epoch 40, loss 0.5616
>Epoch 60, loss 0.5348
>Epoch 80, loss 0.5095
>Epoch 100, loss 0.4915
>Epoch 120, loss 0.5163
>Epoch 140, loss 0.5348
>Epoch 160, loss 0.4385
>Epoch 180, loss 0.4421
>Epoch 200, loss 0.4318
plt.plot(epoch_losses, c='b')
test_bg = dgl.batch(test_graphs)
test_y_tensor = torch.tensor(test_y).float().view(-1,1)
logit = model(test_bg)
probs = torch.softmax(logit, 1).detach().numpy()
pred_y = np.argmax(probs,1)

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
accuracy_score(test_y, pred_y)
print(classification_report(test_y, pred_y))
              precision    recall  f1-score   support

           0       0.70      0.86      0.78       102
           1       0.79      0.64      0.71       115
           2       0.87      0.82      0.85        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.79      0.78      0.78       257
weighted avg       0.77      0.76      0.76       257

Hmm not so bad. OK I tried random forest next. I would like to use RDKit new descriptor, ‘dragon-type descriptor’. So I used it ;)

from rdkit.Chem import AllChem
from rdkit.Chem.Descriptors import rdMolDescriptors
from sklearn.preprocessing import normalize
# generate 3D conf
train_mols2 = copy.deepcopy(train_mols)
test_mols2 = copy.deepcopy(test_mols)

ps = AllChem.ETKDGv2()
for m in train_mols2:
    m = Chem.AddHs(m)
for m in test_mols2:
    m = Chem.AddHs(m)
def calc_dragon_type_desc(mol):
    return rdMolDescriptors.CalcAUTOCORR3D(mol) + rdMolDescriptors.CalcMORSE(mol) + \
        rdMolDescriptors.CalcRDF(mol) + rdMolDescriptors.CalcWHIM(mol)
train_X = normalize([calc_dragon_type_desc(m) for m in train_mols2])
test_X = normalize([calc_dragon_type_desc(m) for m in test_mols2])

For convenience I use only one conformer the above code.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100), train_y)
rf_pred_y = rfc.predict(test_X)
accuracy_score(test_y, rf_pred_y)
print(classification_report(test_y, rf_pred_y))
              precision    recall  f1-score   support

           0       0.77      0.87      0.82       102
           1       0.79      0.66      0.72       115
           2       0.67      0.75      0.71        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.74      0.76      0.75       257
weighted avg       0.76      0.76      0.76       257

Random Forest showed same performance with GCN in this test.
DGL is very useful package for graph based deeplearning I think. Their original repo provides many example codes! Reader who is interested in pls check the repo.

I uploaded today’s code my git hub and it can check following URL.

Build QSAR model with pytorch and rdkit #RDKit

There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch.
I have tried Keras, Chainer and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit.
You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer.
I used solubility data that is provided from rdkit and I used the dataset before.

Let’s start coding.
At first I imported package that is needed for QSAR and defined some utility functions.

import pprint
import argparse
import torch
import torch.optim as optim
from torch import nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
#from sklearn import preprocessing

def base_parser():
    parser = argparse.ArgumentParser("This is simple test of pytorch")
    parser.add_argument("trainset", help="sdf for train")
    parser.add_argument("testset", help="sdf for test")
    parser.add_argument("--epochs", default=150)
    return parser

parser = base_parser()
args = parser.parse_args()
traindata = [mol for mol in Chem.SDMolSupplier(args.trainset) if mol is not None]
testdata = [mol for mol in Chem.SDMolSupplier(args.testset) if mol is not None]

def molsfeaturizer(mols):
    fps = []
    for mol in mols:
        arr = np.zeros((0,))
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        DataStructs.ConvertToNumpyArray(fp, arr)
    fps = np.array(fps, dtype = np.float)
    return fps

classes = {"(A) low":0, "(B) medium":1, "(C) high":2}
#classes = {"(A) low":0, "(B) medium":1, "(C) high":1}

trainx = molsfeaturizer(traindata)
testx = molsfeaturizer(testdata)
# for pytorch, y must be long type!!
trainy = np.array([classes[mol.GetProp("SOL_classification")] for mol in traindata], dtype=np.int64)
testy = np.array([classes[mol.GetProp("SOL_classification")] for mol in testdata], dtype=np.int64)

torch.from_numpy function can convert numpy array to torch tensor. It is very convenient for us.
And then I defined neural network. I feel this method is very unique because I mostly use Keras for deep learning.
To build the model in pytorch, I need define the each layer and whole structure.

X_train = torch.from_numpy(trainx)
X_test = torch.from_numpy(testx)
Y_train = torch.from_numpy(trainy)
Y_test = torch.from_numpy(testy)
print(X_test.size(), Y_train.size())

class QSAR_mlp(nn.Module):
    def __init__(self):
        super(QSAR_mlp, self).__init__()
        self.fc1 = nn.Linear(2048, 524)
        self.fc2 = nn.Linear(524, 10)
        self.fc3 = nn.Linear(10, 10)
        self.fc4 = nn.Linear(10,3)
    def forward(self, x):
        x = x.view(-1, 2048)
        h1 = F.relu(self.fc1(x))
        h2 = F.relu(self.fc2(h1))
        h3 = F.relu(self.fc3(h2))
        output = F.sigmoid(self.fc4(h3))
        return output

After defining the model I tried to lean and prediction.
Following code is training and prediction parts.

model = QSAR_mlp()

losses = []
optimizer = optim.Adam( model.parameters(), lr=0.005)
for epoch in range(args.epochs):
    data, target = Variable(X_train).float(), Variable(Y_train).long()
    y_pred = model(data)
    loss = F.cross_entropy(y_pred, target)
    print("Loss: {}".format([0]))

pred_y = model(Variable(X_test).float())
predicted = torch.max(pred_y, 1)[1]

for i in range(len(predicted)):
    print("pred:{}, target:{}".format([i], Y_test[i]))

print( "Accuracy: {}".format(sum(p==t for p,t in zip(, Y_test))/len(Y_test)))

Check the code.

iwatobipen$ python solubility.train.sdf solubility.test.sdf 
torch.Size([1025, 2048]) torch.Size([1025])
torch.Size([257, 2048]) torch.Size([1025])
  (fc1): Linear(in_features=2048, out_features=524)
  (fc2): Linear(in_features=524, out_features=10)
  (fc3): Linear(in_features=10, out_features=10)
  (fc4): Linear(in_features=10, out_features=3)
Loss: 1.1143544912338257
Loss: 0.6231405735015869
pred:1, target:0
pred:1, target:0
pred:0, target:0
Accuracy: 0.642023346303502

Hmm, accuracy is not so high. I believe there is still room for improvement. I am newbie of pytorch. I will try to practice pytorch next year.

This is my last code of this year. I would like to post my blog more in next year.
If readers who find mistake in my code, please let me know.

Have a happy new year !!!!