There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch.
I have tried Keras, Chainer and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit.
You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer.
I used solubility data that is provided from rdkit and I used the dataset before.
Let’s start coding.
At first I imported package that is needed for QSAR and defined some utility functions.
import pprint import argparse import torch import torch.optim as optim from torch import nn as nn import torch.nn.functional as F from torch.autograd import Variable from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import DataStructs import numpy as np #from sklearn import preprocessing def base_parser(): parser = argparse.ArgumentParser("This is simple test of pytorch") parser.add_argument("trainset", help="sdf for train") parser.add_argument("testset", help="sdf for test") parser.add_argument("--epochs", default=150) return parser parser = base_parser() args = parser.parse_args() traindata = [mol for mol in Chem.SDMolSupplier(args.trainset) if mol is not None] testdata = [mol for mol in Chem.SDMolSupplier(args.testset) if mol is not None] def molsfeaturizer(mols): fps = [] for mol in mols: arr = np.zeros((0,)) fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2) DataStructs.ConvertToNumpyArray(fp, arr) fps.append(arr) fps = np.array(fps, dtype = np.float) return fps classes = {"(A) low":0, "(B) medium":1, "(C) high":2} #classes = {"(A) low":0, "(B) medium":1, "(C) high":1} trainx = molsfeaturizer(traindata) testx = molsfeaturizer(testdata) # for pytorch, y must be long type!! trainy = np.array([classes[mol.GetProp("SOL_classification")] for mol in traindata], dtype=np.int64) testy = np.array([classes[mol.GetProp("SOL_classification")] for mol in testdata], dtype=np.int64)
torch.from_numpy function can convert numpy array to torch tensor. It is very convenient for us.
And then I defined neural network. I feel this method is very unique because I mostly use Keras for deep learning.
To build the model in pytorch, I need define the each layer and whole structure.
X_train = torch.from_numpy(trainx) X_test = torch.from_numpy(testx) Y_train = torch.from_numpy(trainy) Y_test = torch.from_numpy(testy) print(X_train.size(),Y_train.size()) print(X_test.size(), Y_train.size()) class QSAR_mlp(nn.Module): def __init__(self): super(QSAR_mlp, self).__init__() self.fc1 = nn.Linear(2048, 524) self.fc2 = nn.Linear(524, 10) self.fc3 = nn.Linear(10, 10) self.fc4 = nn.Linear(10,3) def forward(self, x): x = x.view(-1, 2048) h1 = F.relu(self.fc1(x)) h2 = F.relu(self.fc2(h1)) h3 = F.relu(self.fc3(h2)) output = F.sigmoid(self.fc4(h3)) return output
After defining the model I tried to lean and prediction.
Following code is training and prediction parts.
model = QSAR_mlp() print(model) losses = [] optimizer = optim.Adam( model.parameters(), lr=0.005) for epoch in range(args.epochs): data, target = Variable(X_train).float(), Variable(Y_train).long() optimizer.zero_grad() y_pred = model(data) loss = F.cross_entropy(y_pred, target) print("Loss: {}".format(loss.data[0])) loss.backward() optimizer.step() pred_y = model(Variable(X_test).float()) predicted = torch.max(pred_y, 1)[1] for i in range(len(predicted)): print("pred:{}, target:{}".format(predicted.data[i], Y_test[i])) print( "Accuracy: {}".format(sum(p==t for p,t in zip(predicted.data, Y_test))/len(Y_test)))
Check the code.
iwatobipen$ python qsar_pytorch.py solubility.train.sdf solubility.test.sdf torch.Size([1025, 2048]) torch.Size([1025]) torch.Size([257, 2048]) torch.Size([1025]) QSAR_mlp( (fc1): Linear(in_features=2048, out_features=524) (fc2): Linear(in_features=524, out_features=10) (fc3): Linear(in_features=10, out_features=10) (fc4): Linear(in_features=10, out_features=3) ) Loss: 1.1143544912338257 -snip- Loss: 0.6231405735015869 pred:1, target:0 pred:1, target:0 -snip- pred:0, target:0 Accuracy: 0.642023346303502
Hmm, accuracy is not so high. I believe there is still room for improvement. I am newbie of pytorch. I will try to practice pytorch next year.
This is my last code of this year. I would like to post my blog more in next year.
If readers who find mistake in my code, please let me know.
Have a happy new year !!!!
;-)
Traceback (most recent call last):
File “qsar_pytorch.py”, line 92, in
print( “Accuracy: {}”.format(sum(p==t for p,t in zip(predicted.data, Y_test))/len(Y_test)))
RuntimeError: value cannot be converted to type uint8_t without overflow: 257
pytorch:0.4
Hi,
My code works in torch 0.3. And I confirmed same error in torch 0.4. There are many changes in this major update.
You can find guide following URL.
https://pytorch.org/2018/04/22/0_4_0-migration-guide.html
If you change last code like following, the code will work.
print(“Accuracy: {}”.format(sum(p.item()==t.item() for p, t in zip(predicted.data, Y_test))/len(Y_test)))
or
print(“Accuracy: {}”.format(sum(p==t for p, t in zip(predicted.data.numpy(), Y_test.numpy()))/len(Y_test)))
Hope this will help you.