Example code of DGL for chemoinformatics task #DGL #chemoinformatics #RDKit #memo

There are many publications about graph based approach for chemoinformatics area. I can’t cover all of them but still have interest these area. I think pytorch_geometric (PyG) and deep graph library (DGL) are very attractive and useful package for chemoinformaticians.

I wrote some posts about DGL and PyG. Recent DGL is more chemoinformatics friendly so I used DGL for GCN model building today.

At first I tried to use DGL from Skorch but it failed. So I use DGL only for model building. As usual, I used solubility data for the test. And used RDKit for compound structure handling. I used GCNClassifier.
To use DGL for chemoinformatics area, user should install mdtraj at first. Because mol_to_graph try to import the package but if the code can’t find mdtraj. User can’t get any error message and mol_to_graph related module can’t use.

import os
from rdkit import Chem
from rdkit import RDPaths
import numpy as np

import torch
import dgl
if torch.cuda.is_available():
    print('use GPU')
    print('use CPU')

from dgl.model_zoo.chem import GCNClassifier
from dgl.data.chem.utils import mol_to_graph
from dgl.data.chem.utils import mol_to_complete_graph
from dgl.data.chem import CanonicalAtomFeaturizer
from dgl.data.chem import CanonicalBondFeaturizer
from torch import nn

import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from torch.nn import CrossEntropyLoss

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2

After data loaded, rdkit mol objects are converted to graph objects. I used canonical atom featurizer for the task. The features used in the module is the same as deepchem.

It is easy to make graph object with DGL. Just call mol_to_complete_graph. Of course you can use smiles_to_complete_graph for SMILE strings. ;)

atom_featurizer = CanonicalAtomFeaturizer()
n_feats = atom_featurizer.feat_size('h')
> 74

ncls = 3
train_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in trainmols]
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in testmols]
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

OK let’s define GCN model. GCNClassifier takes gcn_hidden_feats argument as list object. If user would like to add n GCN layers, user should pass list with n hidden layers parameters. I added 2 GCN layers with 60 and 20 hidden layers to the following model.

# define GCN NET with 4 GCN layers
gcn_net = GCNClassifier(in_feats=n_feats,
gcn_net = gcn_net.to(device)

Then define the collate function for original dataloader object.

def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)
train_data = list(zip(train_g, train_y))
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, collate_fn=collate, drop_last=True)

Next, train the model. This example is multi label classification task so I used CrossEntropyLoss for loss function.

loss_fn = CrossEntropyLoss()
optimizer = torch.optim.Adam(gcn_net.parameters(), lr=0.01)
epoch_losses = []
epoch_accuracies = []
for epoch in range(1,201):
    epoch_loss = 0
    epoch_acc = 0
    for i, (bg, labels) in enumerate(train_loader):
        labels = labels.to(device)
        atom_feats = bg.ndata.pop('h').to(device)
        atom_feats, labels = atom_feats.to(device), labels.to(device)
        pred = gcn_net(bg, atom_feats)
        loss = loss_fn(pred, labels)
        epoch_loss += loss.detach().item()
        pred_cls = pred.argmax(-1).detach().to('cpu').numpy()
        true_label = labels.to('cpu').numpy()
        epoch_acc += sum(true_label==pred_cls) / true_label.shape[0]
    epoch_acc /= (i + 1)
    epoch_loss /= (i + 1)
    if epoch % 20 == 0:
        print(f"epoch: {epoch}, LOSS: {epoch_loss:.3f}, ACC: {epoch_acc:.3f}")

> epoch: 20, LOSS: 0.384, ACC: 0.834
> epoch: 40, LOSS: 0.327, ACC: 0.842
> epoch: 60, LOSS: 0.290, ACC: 0.874
> epoch: 80, LOSS: 0.257, ACC: 0.893
> epoch: 100, LOSS: 0.244, ACC: 0.896
> epoch: 120, LOSS: 0.249, ACC: 0.899
> epoch: 140, LOSS: 0.222, ACC: 0.903
> epoch: 160, LOSS: 0.186, ACC: 0.928
> epoch: 180, LOSS: 0.158, ACC: 0.941
> epoch: 200, LOSS: 0.178, ACC: 0.928

Plot learning process with matplot lib.

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot([i for i in range(1, 201)], epoch_losses, c='b', alpha=0.6, label='loss')
plt.plot([i for i in range(1, 201)], epoch_accuracies, c='r', alpha=0.6, label='acc')

The learning curve indicates that the process works fine.

In summary DGL is useful package for chemoinformatics. So I hope more chemistry related example codes will be provided from original repository. ;)

Today’s code can be found in my repository and gist.


9 thoughts on “Example code of DGL for chemoinformatics task #DGL #chemoinformatics #RDKit #memo

  1. Very nice tutorial as always!

    We are planning to separate code related to chemistry and biology from DGL into a DGL-LifeSci package. Currently it is placed in https://github.com/dmlc/dgl/tree/master/apps/life_sci. Mostly you just need to change the name space. If you want to migrate your code and encounter any issues please let me know.

    2019-nCoV is still going on. Take care.

    • Hi, Mufei!
      Thanks for your comment. I’m following DGL repo and knew that news. Currently DGL-LifeSci can’t install via conda or pypi I think. Do you have plan to registerate dgl-lifesci in anaconda/pypi? I’ll migrate to the package and will contact you when I have any issues. Thanks!

      Hope 2019-nCov will be over soon. Take care too.

  2. Just one question on the bond_featurizer in mol_to_complete_graph function. Why the bond_featurizer is not used for the classification task?

    I have tried to uncomment the bond_featurizer
    g = mol_to_complete_graph(trainmols[0],
    bond_featurizer= bond_featurizer

    But it reported the following error msg:
    DGLError: Expect number of features to match number of edges. Got 8 and 20 instead.

    • Thanks a lot for your explanation! I have read the doc and source code indeed the classifier does not take bond features as inputs.

      But just for curiosity, why the mol_to_complete_graph reported the error “DGLError: Expect number of features to match number of edges. Got 8 and 20 instead.”, when we tried to generate the bond features for the input compound.

      • Hi Mark,
        This is because that the edge_feature for mol_to_complete_graph requires all features of nodes even if the combination of nodes don’t construct bond.
        In my code, trainmols[0] has 5 atoms. It’s means 5P2=20. You can see the details in line 190 of following code https://github.com/dmlc/dgl/blob/master/python/dgl/data/chem/utils/mol_to_graph.py
        However, CanonicalBondFeaturizer returns existing bond features only, the molecule has 4 bonds, its means 8(4×2) edges.
        To use the featurizer, you can use mol_to_bigraph function. Both function returns same graph object if user pass only atom features.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.