There are many publications about graph based approach for chemoinformatics area. I can’t cover all of them but still have interest these area. I think pytorch_geometric (PyG) and deep graph library (DGL) are very attractive and useful package for chemoinformaticians.
I wrote some posts about DGL and PyG. Recent DGL is more chemoinformatics friendly so I used DGL for GCN model building today.
At first I tried to use DGL from Skorch but it failed. So I use DGL only for model building. As usual, I used solubility data for the test. And used RDKit for compound structure handling. I used GCNClassifier.
To use DGL for chemoinformatics area, user should install mdtraj at first. Because mol_to_graph try to import the package but if the code can’t find mdtraj. User can’t get any error message and mol_to_graph related module can’t use.
https://github.com/dmlc/dgl/blob/master/python/dgl/data/chem/utils/mol_to_graph.py
import os from rdkit import Chem from rdkit import RDPaths import numpy as np import torch import dgl if torch.cuda.is_available(): print('use GPU') device='cuda' else: print('use CPU') device='cpu' from dgl.model_zoo.chem import GCNClassifier from dgl.data.chem.utils import mol_to_graph from dgl.data.chem.utils import mol_to_complete_graph from dgl.data.chem import CanonicalAtomFeaturizer from dgl.data.chem import CanonicalBondFeaturizer from torch import nn import torch.nn.functional as F from torch.utils.data import DataLoader from torch.utils.data import Dataset from torch.nn import CrossEntropyLoss trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf') testsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf') trainmols = [m for m in Chem.SDMolSupplier(trainsdf)] testmols = [m for m in Chem.SDMolSupplier(testsdf)] prop_dict = { "(A) low": 0, "(B) medium": 1, "(C) high": 2 }
After data loaded, rdkit mol objects are converted to graph objects. I used canonical atom featurizer for the task. The features used in the module is the same as deepchem.
It is easy to make graph object with DGL. Just call mol_to_complete_graph. Of course you can use smiles_to_complete_graph for SMILE strings. ;)
atom_featurizer = CanonicalAtomFeaturizer() n_feats = atom_featurizer.feat_size('h') print(n_feats) > 74 ncls = 3 train_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in trainmols] train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols]) train_y = np.array(train_y, dtype=np.int64) test_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in testmols] test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols]) test_y = np.array(test_y, dtype=np.int64)
OK let’s define GCN model. GCNClassifier takes gcn_hidden_feats argument as list object. If user would like to add n GCN layers, user should pass list with n hidden layers parameters. I added 2 GCN layers with 60 and 20 hidden layers to the following model.
# define GCN NET with 4 GCN layers gcn_net = GCNClassifier(in_feats=n_feats, gcn_hidden_feats=[60,20], n_tasks=ncls, classifier_hidden_feats=10, dropout=0.5,) gcn_net = gcn_net.to(device)
Then define the collate function for original dataloader object.
def collate(sample): graphs, labels = map(list,zip(*sample)) batched_graph = dgl.batch(graphs) batched_graph.set_n_initializer(dgl.init.zero_initializer) batched_graph.set_e_initializer(dgl.init.zero_initializer) return batched_graph, torch.tensor(labels) train_data = list(zip(train_g, train_y)) train_loader = DataLoader(train_data, batch_size=128, shuffle=True, collate_fn=collate, drop_last=True)
Next, train the model. This example is multi label classification task so I used CrossEntropyLoss for loss function.
loss_fn = CrossEntropyLoss() optimizer = torch.optim.Adam(gcn_net.parameters(), lr=0.01) gcn_net.train() epoch_losses = [] epoch_accuracies = [] for epoch in range(1,201): epoch_loss = 0 epoch_acc = 0 for i, (bg, labels) in enumerate(train_loader): labels = labels.to(device) atom_feats = bg.ndata.pop('h').to(device) atom_feats, labels = atom_feats.to(device), labels.to(device) pred = gcn_net(bg, atom_feats) loss = loss_fn(pred, labels) optimizer.zero_grad() loss.backward() optimizer.step() epoch_loss += loss.detach().item() pred_cls = pred.argmax(-1).detach().to('cpu').numpy() true_label = labels.to('cpu').numpy() epoch_acc += sum(true_label==pred_cls) / true_label.shape[0] epoch_acc /= (i + 1) epoch_loss /= (i + 1) if epoch % 20 == 0: print(f"epoch: {epoch}, LOSS: {epoch_loss:.3f}, ACC: {epoch_acc:.3f}") epoch_accuracies.append(epoch_acc) epoch_losses.append(epoch_loss) > epoch: 20, LOSS: 0.384, ACC: 0.834 > epoch: 40, LOSS: 0.327, ACC: 0.842 > epoch: 60, LOSS: 0.290, ACC: 0.874 > epoch: 80, LOSS: 0.257, ACC: 0.893 > epoch: 100, LOSS: 0.244, ACC: 0.896 > epoch: 120, LOSS: 0.249, ACC: 0.899 > epoch: 140, LOSS: 0.222, ACC: 0.903 > epoch: 160, LOSS: 0.186, ACC: 0.928 > epoch: 180, LOSS: 0.158, ACC: 0.941 > epoch: 200, LOSS: 0.178, ACC: 0.928
Plot learning process with matplot lib.
%matplotlib inline import matplotlib.pyplot as plt plt.style.use('ggplot') plt.plot([i for i in range(1, 201)], epoch_losses, c='b', alpha=0.6, label='loss') plt.legend() plt.plot([i for i in range(1, 201)], epoch_accuracies, c='r', alpha=0.6, label='acc') plt.legend() plt.xlabel('epoch') plt.ylabel('loss/acc')

The learning curve indicates that the process works fine.
In summary DGL is useful package for chemoinformatics. So I hope more chemistry related example codes will be provided from original repository. ;)
Today’s code can be found in my repository and gist.
https://github.com/iwatobipen/playground/blob/master/gcn_dgl_test.ipynb
Very nice tutorial as always!
We are planning to separate code related to chemistry and biology from DGL into a DGL-LifeSci package. Currently it is placed in https://github.com/dmlc/dgl/tree/master/apps/life_sci. Mostly you just need to change the name space. If you want to migrate your code and encounter any issues please let me know.
2019-nCoV is still going on. Take care.
Hi, Mufei!
Thanks for your comment. I’m following DGL repo and knew that news. Currently DGL-LifeSci can’t install via conda or pypi I think. Do you have plan to registerate dgl-lifesci in anaconda/pypi? I’ll migrate to the package and will contact you when I have any issues. Thanks!
Hope 2019-nCov will be over soon. Take care too.
Currently you can only install it from source, which basically requires cloning the repo and installing it with pip install -e . We will release conda/pip distributions in the future. Thank you.
Hi,
You can now install DGL-LifeSci with pip or conda by following the REAMDE(https://github.com/dmlc/dgl/tree/master/apps/life_sci). Let me know if you have any comments, Thanks.
Thanks for your notification!
That sounds nice.
Thanks for your great work.
Just one question on the bond_featurizer in mol_to_complete_graph function. Why the bond_featurizer is not used for the classification task?
I have tried to uncomment the bond_featurizer
g = mol_to_complete_graph(trainmols[0],
add_self_loop=False,
atom_featurizer=atom_featurizer,
bond_featurizer= bond_featurizer
)
But it reported the following error msg:
DGLError: Expect number of features to match number of edges. Got 8 and 20 instead.
Hi Mark,
Thanks for your query. I commented out bond feature in the code because GCN classifier doesn’t use bond feature, just use atom feature.
You can see the details of the model in following URL.
https://docs.dgl.ai/en/0.4.x/_modules/dgl/model_zoo/chem/classifiers.html#GCNClassifier
Hope this will be help for you.
Thanks,
Thanks a lot for your explanation! I have read the doc and source code indeed the classifier does not take bond features as inputs.
But just for curiosity, why the mol_to_complete_graph reported the error “DGLError: Expect number of features to match number of edges. Got 8 and 20 instead.”, when we tried to generate the bond features for the input compound.
Hi Mark,
This is because that the edge_feature for mol_to_complete_graph requires all features of nodes even if the combination of nodes don’t construct bond.
In my code, trainmols[0] has 5 atoms. It’s means 5P2=20. You can see the details in line 190 of following code https://github.com/dmlc/dgl/blob/master/python/dgl/data/chem/utils/mol_to_graph.py
However, CanonicalBondFeaturizer returns existing bond features only, the molecule has 4 bonds, its means 8(4×2) edges.
To use the featurizer, you can use mol_to_bigraph function. Both function returns same graph object if user pass only atom features.
Thanks,
i always get the error [[cannot import name ‘model_zoo’ from ‘dgl’]], i conda install DGL for the official website