Recently many machine learning articles use pytorch for their implementation. And I found very attractive package for graph based deep learning, named ‘DGL;Deep Graph Library’. The package supports pytorch and mxnet for backend. The author provides not only package but also very nice documentation. I read the document and try GCN for QSPR with DGL.
https://www.dgl.ai/pages/about.html
The package can make graph object from networkx and can convert graph object to networkx object.
Following code example is based their example about batched graph convolution.
https://docs.dgl.ai/tutorials/basics/4_batch.html
At first, import packages which will use.
%matplotlib inline import matplotlib.pyplot as plt import seaborn as sns from collections import namedtuple import dgl from dgl import DGLGraph import dgl.function as fn import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import Dataset, TensorDataset from torch.utils.data import DataLoader import torch.optim as optim import networkx as nx import copy import os from rdkit import Chem from rdkit.Chem import RDConfig import numpy as np
Next define mol2graph function. Following example only node features used bond features did not used.
# Following code borrowed from dgl's junction tree example. ELEM_LIST = ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg', 'Na', 'Ca', 'Fe', 'Al', 'I', 'B', 'K', 'Se', 'Zn', 'H', 'Cu', 'Mn', 'unknown'] ATOM_FDIM = len(ELEM_LIST) + 6 + 5 + 1 MAX_ATOMNUM =60 BOND_FDIM = 5 MAX_NB = 10 PAPER = os.getenv('PAPER', False) def onek_encoding_unk(x, allowable_set): if x not in allowable_set: x = allowable_set[-1] return [x == s for s in allowable_set] # Note that during graph decoding they don't predict stereochemistry-related # characteristics (i.e. Chiral Atoms, E-Z, Cis-Trans). Instead, they decode # the 2-D graph first, then enumerate all possible 3-D forms and find the # one with highest score. ''' def atom_features(atom): return (torch.Tensor(onek_encoding_unk(atom.GetSymbol(), ELEM_LIST) + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5]) + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0]) + [atom.GetIsAromatic()])) ''' def atom_features(atom): return (onek_encoding_unk(atom.GetSymbol(), ELEM_LIST) + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5]) + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0]) + [atom.GetIsAromatic()]) def bond_features(bond): bt = bond.GetBondType() return (torch.Tensor([bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE, bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC, bond.IsInRing()])) def mol2dgl_single(mols): cand_graphs = [] n_nodes = 0 n_edges = 0 bond_x = [] for mol in mols: n_atoms = mol.GetNumAtoms() n_bonds = mol.GetNumBonds() g = DGLGraph() nodeF = [] for i, atom in enumerate(mol.GetAtoms()): assert i == atom.GetIdx() nodeF.append(atom_features(atom)) g.add_nodes(n_atoms) bond_src = [] bond_dst = [] for i, bond in enumerate(mol.GetBonds()): a1 = bond.GetBeginAtom() a2 = bond.GetEndAtom() begin_idx = a1.GetIdx() end_idx = a2.GetIdx() features = bond_features(bond) bond_src.append(begin_idx) bond_dst.append(end_idx) bond_x.append(features) bond_src.append(end_idx) bond_dst.append(begin_idx) bond_x.append(features) g.add_edges(bond_src, bond_dst) g.ndata['h'] = torch.Tensor(nodeF) cand_graphs.append(g) return cand_graphs
Next defined the original collate function for data loader. And defined msg and reduce function.
msg function get message from neighbor node and reduce function aggregates the massages.
msg = fn.copy_src(src="h", out="m") def collate(sample): graphs, labels = map(list,zip(*sample)) batched_graph = dgl.batch(graphs) return batched_graph, torch.tensor(labels) def reduce(nodes): # summazation by avarage is different part accum = torch.mean(nodes.mailbox['m'], 1) return {'h': accum}
Then defined the network. By using the dgl user can easily access node and features. For example graph.ndata[‘name’] method can access node features named ‘name’. NodeApplyModule is used for calculation of each node.
It is worth to know that by using torch nn.ModuleList, user can write network like ‘Keras’.
class NodeApplyModule(nn.Module): def __init__(self, in_feats, out_feats, activation): super(NodeApplyModule, self).__init__() self.linear = nn.Linear(in_feats, out_feats) self.activation = activation def forward(self, node): h = self.linear(node.data['h']) h = self.activation(h) return {'h': h} class GCN(nn.Module): def __init__(self, in_feats, out_feats, activation): super(GCN, self).__init__() self.apply_mod = NodeApplyModule(in_feats, out_feats, activation) def forward(self, g, feature): g.ndata['h'] = feature g.update_all(msg, reduce) g.apply_nodes(func=self.apply_mod) h = g.ndata.pop('h') #print(h.shape) return h class Classifier(nn.Module): def __init__(self, in_dim, hidden_dim, n_classes): super(Classifier, self).__init__() self.layers = nn.ModuleList([GCN(in_dim, hidden_dim, F.relu), GCN(hidden_dim, hidden_dim, F.relu)]) self.classify = nn.Linear(hidden_dim, n_classes) def forward(self, g): h = g.ndata['h'] for conv in self.layers: h = conv(g, h) g.ndata['h'] = h hg = dgl.mean_nodes(g, 'h') return self.classify(hg)
Let’s load data. I used solubility dataset in RDKit
solcls = {'(A) low':0, '(B) medium':1, '(C) high':2} train_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.train.sdf'))] train_y = [solcls[m.GetProp('SOL_classification')] for m in train_mols] test_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.test.sdf'))] test_y = [solcls[m.GetProp('SOL_classification')] for m in test_mols] train_graphs = mol2dgl_single(train_mols) test_graphs = mol2dgl_single(test_mols) dataset = list(zip(train_graphs, train_y)) data_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate)
Finally run train and check the performance.
model = Classifier(ATOM_FDIM, 256, len(solcls)) loss_func = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) model.train() epoch_losses = [] for epoch in range(200): epoch_loss = 0 for i, (bg, label) in enumerate(data_loader): bg.set_e_initializer(dgl.init.zero_initializer) bg.set_n_initializer(dgl.init.zero_initializer) pred = model(bg) loss = loss_func(pred, label) optimizer.zero_grad() loss.backward() optimizer.step() epoch_loss += loss.detach().item() epoch_loss /= (i + 1) if (epoch+1) % 20 == 0: print('Epoch {}, loss {:.4f}'.format(epoch+1, epoch_loss)) epoch_losses.append(epoch_loss) >Epoch 20, loss 0.6104 >Epoch 40, loss 0.5616 >Epoch 60, loss 0.5348 >Epoch 80, loss 0.5095 >Epoch 100, loss 0.4915 >Epoch 120, loss 0.5163 >Epoch 140, loss 0.5348 >Epoch 160, loss 0.4385 >Epoch 180, loss 0.4421 >Epoch 200, loss 0.4318 plt.plot(epoch_losses, c='b')

model.eval() test_bg = dgl.batch(test_graphs) test_y_tensor = torch.tensor(test_y).float().view(-1,1) test_bg.set_e_initializer(dgl.init.zero_initializer) test_bg.set_n_initializer(dgl.init.zero_initializer) logit = model(test_bg) probs = torch.softmax(logit, 1).detach().numpy() pred_y = np.argmax(probs,1) from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix accuracy_score(test_y, pred_y) >0.7587548638132295 print(classification_report(test_y, pred_y)) precision recall f1-score support 0 0.70 0.86 0.78 102 1 0.79 0.64 0.71 115 2 0.87 0.82 0.85 40 micro avg 0.76 0.76 0.76 257 macro avg 0.79 0.78 0.78 257 weighted avg 0.77 0.76 0.76 257
Hmm not so bad. OK I tried random forest next. I would like to use RDKit new descriptor, ‘dragon-type descriptor’. So I used it ;)
https://github.com/rdkit/UGM_2018/blob/master/Notebooks/Landrum_Whats_New.ipynb
from rdkit.Chem import AllChem from rdkit.Chem.Descriptors import rdMolDescriptors from sklearn.preprocessing import normalize # generate 3D conf train_mols2 = copy.deepcopy(train_mols) test_mols2 = copy.deepcopy(test_mols) ps = AllChem.ETKDGv2() for m in train_mols2: m = Chem.AddHs(m) AllChem.EmbedMolecule(m,ps) for m in test_mols2: m = Chem.AddHs(m) AllChem.EmbedMolecule(m,ps) def calc_dragon_type_desc(mol): return rdMolDescriptors.CalcAUTOCORR3D(mol) + rdMolDescriptors.CalcMORSE(mol) + \ rdMolDescriptors.CalcRDF(mol) + rdMolDescriptors.CalcWHIM(mol) train_X = normalize([calc_dragon_type_desc(m) for m in train_mols2]) test_X = normalize([calc_dragon_type_desc(m) for m in test_mols2])
For convenience I use only one conformer the above code.
from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(n_estimators=100) rfc.fit(train_X, train_y) rf_pred_y = rfc.predict(test_X)
accuracy_score(test_y, rf_pred_y) >0.7587548638132295 print(classification_report(test_y, rf_pred_y)) precision recall f1-score support 0 0.77 0.87 0.82 102 1 0.79 0.66 0.72 115 2 0.67 0.75 0.71 40 micro avg 0.76 0.76 0.76 257 macro avg 0.74 0.76 0.75 257 weighted avg 0.76 0.76 0.76 257
Random Forest showed same performance with GCN in this test.
DGL is very useful package for graph based deeplearning I think. Their original repo provides many example codes! Reader who is interested in pls check the repo.
I uploaded today’s code my git hub and it can check following URL.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/GCN_chemo.ipynb
Thank you for your interest in our project. We are now officially supporting a model zoo for Chemistry, which you might be interested: https://github.com/dmlc/dgl/tree/master/examples/pytorch/model_zoo/chem. Any feedback is appreciated, e.g. the utilities/models you’d like to see in the future.
Hi Mufei,
Thanks for your comment. It is awesome work! It is really useful I think. Sure, I will feedback when I have any idea or questions!