Example code of DGL for chemoinformatics task #DGL #chemoinformatics #RDKit #memo

There are many publications about graph based approach for chemoinformatics area. I can’t cover all of them but still have interest these area. I think pytorch_geometric (PyG) and deep graph library (DGL) are very attractive and useful package for chemoinformaticians.

I wrote some posts about DGL and PyG. Recent DGL is more chemoinformatics friendly so I used DGL for GCN model building today.

At first I tried to use DGL from Skorch but it failed. So I use DGL only for model building. As usual, I used solubility data for the test. And used RDKit for compound structure handling. I used GCNClassifier.
To use DGL for chemoinformatics area, user should install mdtraj at first. Because mol_to_graph try to import the package but if the code can’t find mdtraj. User can’t get any error message and mol_to_graph related module can’t use.
https://github.com/dmlc/dgl/blob/master/python/dgl/data/chem/utils/mol_to_graph.py

import os
from rdkit import Chem
from rdkit import RDPaths
import numpy as np

import torch
import dgl
if torch.cuda.is_available():
    print('use GPU')
    device='cuda'
else:
    print('use CPU')
    device='cpu'

from dgl.model_zoo.chem import GCNClassifier
from dgl.data.chem.utils import mol_to_graph
from dgl.data.chem.utils import mol_to_complete_graph
from dgl.data.chem import CanonicalAtomFeaturizer
from dgl.data.chem import CanonicalBondFeaturizer
from torch import nn

import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from torch.nn import CrossEntropyLoss

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2
}

After data loaded, rdkit mol objects are converted to graph objects. I used canonical atom featurizer for the task. The features used in the module is the same as deepchem.

It is easy to make graph object with DGL. Just call mol_to_complete_graph. Of course you can use smiles_to_complete_graph for SMILE strings. ;)

atom_featurizer = CanonicalAtomFeaturizer()
n_feats = atom_featurizer.feat_size('h')
print(n_feats)
> 74

ncls = 3
train_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in trainmols]
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in testmols]
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

OK let’s define GCN model. GCNClassifier takes gcn_hidden_feats argument as list object. If user would like to add n GCN layers, user should pass list with n hidden layers parameters. I added 2 GCN layers with 60 and 20 hidden layers to the following model.

# define GCN NET with 4 GCN layers
gcn_net = GCNClassifier(in_feats=n_feats,
                    gcn_hidden_feats=[60,20],
                    n_tasks=ncls,
                    classifier_hidden_feats=10,
                    dropout=0.5,)
gcn_net = gcn_net.to(device)

Then define the collate function for original dataloader object.

def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    batched_graph.set_n_initializer(dgl.init.zero_initializer)
    batched_graph.set_e_initializer(dgl.init.zero_initializer)
    return batched_graph, torch.tensor(labels)
train_data = list(zip(train_g, train_y))
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, collate_fn=collate, drop_last=True)

Next, train the model. This example is multi label classification task so I used CrossEntropyLoss for loss function.

loss_fn = CrossEntropyLoss()
optimizer = torch.optim.Adam(gcn_net.parameters(), lr=0.01)
gcn_net.train()
epoch_losses = []
epoch_accuracies = []
for epoch in range(1,201):
    epoch_loss = 0
    epoch_acc = 0
    for i, (bg, labels) in enumerate(train_loader):
        labels = labels.to(device)
        atom_feats = bg.ndata.pop('h').to(device)
        atom_feats, labels = atom_feats.to(device), labels.to(device)
        pred = gcn_net(bg, atom_feats)
        loss = loss_fn(pred, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.detach().item()
        pred_cls = pred.argmax(-1).detach().to('cpu').numpy()
        true_label = labels.to('cpu').numpy()
        epoch_acc += sum(true_label==pred_cls) / true_label.shape[0]
    epoch_acc /= (i + 1)
    epoch_loss /= (i + 1)
    if epoch % 20 == 0:
        print(f"epoch: {epoch}, LOSS: {epoch_loss:.3f}, ACC: {epoch_acc:.3f}")
    epoch_accuracies.append(epoch_acc)
    epoch_losses.append(epoch_loss)

> epoch: 20, LOSS: 0.384, ACC: 0.834
> epoch: 40, LOSS: 0.327, ACC: 0.842
> epoch: 60, LOSS: 0.290, ACC: 0.874
> epoch: 80, LOSS: 0.257, ACC: 0.893
> epoch: 100, LOSS: 0.244, ACC: 0.896
> epoch: 120, LOSS: 0.249, ACC: 0.899
> epoch: 140, LOSS: 0.222, ACC: 0.903
> epoch: 160, LOSS: 0.186, ACC: 0.928
> epoch: 180, LOSS: 0.158, ACC: 0.941
> epoch: 200, LOSS: 0.178, ACC: 0.928

Plot learning process with matplot lib.

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.plot([i for i in range(1, 201)], epoch_losses, c='b', alpha=0.6, label='loss')
plt.legend()
plt.plot([i for i in range(1, 201)], epoch_accuracies, c='r', alpha=0.6, label='acc')
plt.legend()
plt.xlabel('epoch')
plt.ylabel('loss/acc')

The learning curve indicates that the process works fine.

In summary DGL is useful package for chemoinformatics. So I hope more chemistry related example codes will be provided from original repository. ;)

Today’s code can be found in my repository and gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

https://github.com/iwatobipen/playground/blob/master/gcn_dgl_test.ipynb

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

10 thoughts on “Example code of DGL for chemoinformatics task #DGL #chemoinformatics #RDKit #memo

  1. Very nice tutorial as always!

    We are planning to separate code related to chemistry and biology from DGL into a DGL-LifeSci package. Currently it is placed in https://github.com/dmlc/dgl/tree/master/apps/life_sci. Mostly you just need to change the name space. If you want to migrate your code and encounter any issues please let me know.

    2019-nCoV is still going on. Take care.

    1. Hi, Mufei!
      Thanks for your comment. I’m following DGL repo and knew that news. Currently DGL-LifeSci can’t install via conda or pypi I think. Do you have plan to registerate dgl-lifesci in anaconda/pypi? I’ll migrate to the package and will contact you when I have any issues. Thanks!

      Hope 2019-nCov will be over soon. Take care too.

      1. Currently you can only install it from source, which basically requires cloning the repo and installing it with pip install -e . We will release conda/pip distributions in the future. Thank you.

  2. Just one question on the bond_featurizer in mol_to_complete_graph function. Why the bond_featurizer is not used for the classification task?

    I have tried to uncomment the bond_featurizer
    g = mol_to_complete_graph(trainmols[0],
    add_self_loop=False,
    atom_featurizer=atom_featurizer,
    bond_featurizer= bond_featurizer
    )

    But it reported the following error msg:
    DGLError: Expect number of features to match number of edges. Got 8 and 20 instead.

    1. Thanks a lot for your explanation! I have read the doc and source code indeed the classifier does not take bond features as inputs.

      But just for curiosity, why the mol_to_complete_graph reported the error “DGLError: Expect number of features to match number of edges. Got 8 and 20 instead.”, when we tried to generate the bond features for the input compound.

      1. Hi Mark,
        This is because that the edge_feature for mol_to_complete_graph requires all features of nodes even if the combination of nodes don’t construct bond.
        In my code, trainmols[0] has 5 atoms. It’s means 5P2=20. You can see the details in line 190 of following code https://github.com/dmlc/dgl/blob/master/python/dgl/data/chem/utils/mol_to_graph.py
        However, CanonicalBondFeaturizer returns existing bond features only, the molecule has 4 bonds, its means 8(4×2) edges.
        To use the featurizer, you can use mol_to_bigraph function. Both function returns same graph object if user pass only atom features.
        Thanks,

  3. i always get the error [[cannot import name ‘model_zoo’ from ‘dgl’]], i conda install DGL for the official website

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: