Use pytorch for QSAR model building more simply like scikit-learn #pytorch #chemoinformatics #RDKit

I often use pytorch for deep learning framework. I like pytorch because it is very flexible and many recent articles are used it for their implementation.

But to build model and train the model, I need to define training method. So it seems nice if I can train pytorch model just calling fit like scikit-learn doesn’t it?

Fortunately by using skorch, it will be enable training process to be simple.

Sktorch is a scikit-learn compatible network library that waps pytorch. Wow it’s cool! Skorch can be installed via pip command.

I installed it and then tried to build QSAR model with skorch.
At first, import packages and make dataset for training and test. For classification task, class index should be float32. float64 will cause Error.

import os
from rdkit import Chem
from rdkit import RDPaths
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')
prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2
}

def fp2arr(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr


nbits = 1024
ncls = 3

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

train_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits) for m in trainmols]
train_X = np.array([fp2arr(fp) for fp in train_fps], dtype=np.float32)

train_y = np.array([ohencoder(prop_dict[m.GetProp('SOL_classification')], ncls) for m in trainmols])
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits) for m in testmols]
test_X = np.array([fp2arr(fp) for fp in test_fps], dtype=np.float32)

test_y = np.array([ohencoder(prop_dict[m.GetProp('SOL_classification')], ncls) for m in testmols])
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

Then import pytorch and skorch. I checked whether GPU is available or not.

import torch
if torch.cuda.is_available():
    print('use GPU')
    device='cuda'
else:
    print('use CPU')
    device='cpu'

from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

Next, define simple multi-layer perceptron model. Following example is multilabel classification so I used softmax layer at the last layer.

class SimpleMLP(nn.Module):
    def __init__(self, num_units=512, nonlin=F.relu):
        super(SimpleMLP, self).__init__()
        self.dense0 = nn.Linear(nbits, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, ncls)
    
    def forward(self, x, **kwargs):
        x = self.nonlin(self.dense0(x))
        x = self.dropout(x)
        x = F.relu(self.dense1(x))
        x = F.softmax(self.output(x), dim=-1)
        return x

Next, make NeuralNetClassifier with defined model. It is very easy to use GPU. If you pass the device=’cuda’ option to the object, that’s all! ;-) After that let’s train the model with fit method.

net = NeuralNetClassifier(SimpleMLP,
                         max_epochs=100,
                         lr=0.1,
                         iterator_train__shuffle=True,
                         device=device
                         )

net.fit(train_X, train_y)

 epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        1.1241       0.3902        1.1102  0.1851
      ----    snip---------
    100        0.0831       0.6146        1.0327  0.0251
Out[10]:
[initialized](
  module_=SimpleMLP(
    (dense0): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=512, out_features=10, bias=True)
    (output): Linear(in_features=10, out_features=3, bias=True)
  ),
)

Ok, let check the model performance.

pred = net.predict(test_X)
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(test_y, pred)
array([[83, 18,  1],
       [35, 72,  8],
       [ 0, 11, 29]])

print(classification_report(test_y, pred))

              precision    recall  f1-score   support

           0       0.70      0.81      0.75       102
           1       0.71      0.63      0.67       115
           2       0.76      0.72      0.74        40

    accuracy                           0.72       257
   macro avg       0.73      0.72      0.72       257
weighted avg       0.72      0.72      0.71       257

It seems work fine.

Then I tried to build regression model. NeuralNetRegressor is used to do it. The model structure is almost same as above but last layer is little bit different.

from skorch import NeuralNetRegressor
class MLP_regressor(nn.Module):
    def __init__(self, num_units=512, nonlin=F.relu):
        super(MLP_regressor, self).__init__()
        self.dense0 = nn.Linear(nbits, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, 1)
    
    def forward(self, x, **kwargs):
        x = self.nonlin(self.dense0(x))
        x = self.dropout(x)
        x = F.relu(self.dense1(x))
        x = self.output(x)
        return x

regressor = NeuralNetRegressor(MLP_regressor,
                              max_epochs=100,
                              iterator_train__shuffle=True,
                              device=device)

Then make dataset for regression task.

train_y_reg = np.array([float(m.GetProp('SOL')) for m in trainmols], dtype=np.float32).reshape(-1,1)
test_y_reg = np.array([float(m.GetProp('SOL')) for m in testmols], dtype=np.float32).reshape(-1,1)
regressor.fit(train_X, train_y_reg)
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1        8.3883       17.1863  0.0306
          ------snip------
    100        0.3729        2.7019  0.0245
[initialized](
  module_=MLP_regressor(
    (dense0): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=512, out_features=10, bias=True)
    (output): Linear(in_features=10, out_features=1, bias=True)
  ),
)

Check the model.

reg_y = regressor.predict(test_X)
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(np.arange(-9,2), np.arange(-9,2), color='pink', alpha=0.7)
plt.scatter(reg_y, test_y_reg)
plt.xlabel('predicted')
plt.ylabel('exp')

from sklearn.metrics import r2_score
print(f"R2 score of testset was {r2_score(test_y_reg, reg_y):.2f}")
> R2 score of testset was 0.65

Current models have room for improvement because I didn’t optimize model structure and parameters. ;)

In summary skorch is very useful package for programmer who would like to write code for model building and training quickly.

I just uploaded today’s code on my githubrepo.

And it can run on binder.

Use pytorch for QSAR model building more simply like scikit-learn #pytorch #chemoinformatics #RDKit

Published by iwatobipen

Leave a comment Cancel reply

Share this:

Related

Published by iwatobipen

Leave a comment Cancel reply