Example code of DGL for chemoinformatics task #DGL #chemoinformatics #RDKit #memo

There are many publications about graph based approach for chemoinformatics area. I can’t cover all of them but still have interest these area. I think pytorch_geometric (PyG) and deep graph library (DGL) are very attractive and useful package for chemoinformaticians.

I wrote some posts about DGL and PyG. Recent DGL is more chemoinformatics friendly so I used DGL for GCN model building today.

At first I tried to use DGL from Skorch but it failed. So I use DGL only for model building. As usual, I used solubility data for the test. And used RDKit for compound structure handling. I used GCNClassifier.
To use DGL for chemoinformatics area, user should install mdtraj at first. Because mol_to_graph try to import the package but if the code can’t find mdtraj. User can’t get any error message and mol_to_graph related module can’t use.
https://github.com/dmlc/dgl/blob/master/python/dgl/data/chem/utils/mol_to_graph.py

import os
from rdkit import Chem
from rdkit import RDPaths
import numpy as np

import torch
import dgl
if torch.cuda.is_available():
    print('use GPU')
    device='cuda'
else:
    print('use CPU')
    device='cpu'

from dgl.model_zoo.chem import GCNClassifier
from dgl.data.chem.utils import mol_to_graph
from dgl.data.chem.utils import mol_to_complete_graph
from dgl.data.chem import CanonicalAtomFeaturizer
from dgl.data.chem import CanonicalBondFeaturizer
from torch import nn

import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from torch.nn import CrossEntropyLoss

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2
}

After data loaded, rdkit mol objects are converted to graph objects. I used canonical atom featurizer for the task. The features used in the module is the same as deepchem.

It is easy to make graph object with DGL. Just call mol_to_complete_graph. Of course you can use smiles_to_complete_graph for SMILE strings. ;)

atom_featurizer = CanonicalAtomFeaturizer()
n_feats = atom_featurizer.feat_size('h')
print(n_feats)
> 74

ncls = 3
train_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in trainmols]
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in testmols]
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

OK let’s define GCN model. GCNClassifier takes gcn_hidden_feats argument as list object. If user would like to add n GCN layers, user should pass list with n hidden layers parameters. I added 2 GCN layers with 60 and 20 hidden layers to the following model.

# define GCN NET with 4 GCN layers
gcn_net = GCNClassifier(in_feats=n_feats,
                    gcn_hidden_feats=[60,20],
                    n_tasks=ncls,
                    classifier_hidden_feats=10,
                    dropout=0.5,)
gcn_net = gcn_net.to(device)

Then define the collate function for original dataloader object.

def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    batched_graph.set_n_initializer(dgl.init.zero_initializer)
    batched_graph.set_e_initializer(dgl.init.zero_initializer)
    return batched_graph, torch.tensor(labels)
train_data = list(zip(train_g, train_y))
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, collate_fn=collate, drop_last=True)

Next, train the model. This example is multi label classification task so I used CrossEntropyLoss for loss function.

loss_fn = CrossEntropyLoss()
optimizer = torch.optim.Adam(gcn_net.parameters(), lr=0.01)
gcn_net.train()
epoch_losses = []
epoch_accuracies = []
for epoch in range(1,201):
    epoch_loss = 0
    epoch_acc = 0
    for i, (bg, labels) in enumerate(train_loader):
        labels = labels.to(device)
        atom_feats = bg.ndata.pop('h').to(device)
        atom_feats, labels = atom_feats.to(device), labels.to(device)
        pred = gcn_net(bg, atom_feats)
        loss = loss_fn(pred, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.detach().item()
        pred_cls = pred.argmax(-1).detach().to('cpu').numpy()
        true_label = labels.to('cpu').numpy()
        epoch_acc += sum(true_label==pred_cls) / true_label.shape[0]
    epoch_acc /= (i + 1)
    epoch_loss /= (i + 1)
    if epoch % 20 == 0:
        print(f"epoch: {epoch}, LOSS: {epoch_loss:.3f}, ACC: {epoch_acc:.3f}")
    epoch_accuracies.append(epoch_acc)
    epoch_losses.append(epoch_loss)

> epoch: 20, LOSS: 0.384, ACC: 0.834
> epoch: 40, LOSS: 0.327, ACC: 0.842
> epoch: 60, LOSS: 0.290, ACC: 0.874
> epoch: 80, LOSS: 0.257, ACC: 0.893
> epoch: 100, LOSS: 0.244, ACC: 0.896
> epoch: 120, LOSS: 0.249, ACC: 0.899
> epoch: 140, LOSS: 0.222, ACC: 0.903
> epoch: 160, LOSS: 0.186, ACC: 0.928
> epoch: 180, LOSS: 0.158, ACC: 0.941
> epoch: 200, LOSS: 0.178, ACC: 0.928

Plot learning process with matplot lib.

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.plot([i for i in range(1, 201)], epoch_losses, c='b', alpha=0.6, label='loss')
plt.legend()
plt.plot([i for i in range(1, 201)], epoch_accuracies, c='r', alpha=0.6, label='acc')
plt.legend()
plt.xlabel('epoch')
plt.ylabel('loss/acc')

The learning curve indicates that the process works fine.

In summary DGL is useful package for chemoinformatics. So I hope more chemistry related example codes will be provided from original repository. ;)

Today’s code can be found in my repository and gist.

https://github.com/iwatobipen/playground/blob/master/gcn_dgl_test.ipynb

New trial of AttentiveFP with new atom feature #DGL #RDKit #Chemoinformatics

Recently I posted an example of AttentiveFP and I found that atom weights doesn’t directly reflect functional groups. And I could get useful suggestion via comment from DGL developper!

And I wonder that how about to use functional group feature to train the model.

But how can I detect functional groups in the molecule? Because functional group is human defined feature.

…. Fortunately, as you know! RDKit has useful function to extract functional group automatically!

Original article is below.
An algorithm to identify functional groups in organic molecules Peter Ertl https://jcheminf.springeropen.com/articles/10.1186/s13321-017-0225-z

And the implementation was found in following URL.
https://github.com/rdkit/rdkit/tree/master/Contrib/IFG

So I used the function to define new atom featurizer. The code is below. The util function can detect functional group of molecule and add the type to atom property. It can use for atom featurizer.

#ifgutil.py
import sys
import os
from rdkit import Chem
from rdkit import RDPaths
from collections import defaultdict
from dgl.data.chem.utils import one_hot_encoding

ifg_path = os.path.join(RDPaths.RDContribDir, 'IFG')
sys.path.append(ifg_path)

import ifg


def map_fgs(mol):
    atoms = list(mol.GetAtoms())
    for atom in atoms:
        atom.SetProp("IFG_TYPE", "")
    fgs = ifg.identify_functional_groups(mol)
    for fg in fgs:
        for atmid in fg.atomIds:
            atom = mol.GetAtomWithIdx(atmid)
            atom.SetProp('IFG_TYPE', fg.type)
    return mol

def make_ifg_list(mols):
    res = set()
    for mol in mols:
        for atom in mol.GetAtoms():
            ifg = atom.GetProp('IFG_TYPE')
            res.add(ifg)
    return list(res)

def atom_ifg_one_hot(atom, allowable_set=None, encode_unknown=False):
    if allowable_set is None:
        raise Exception
    try:
        ifg = atom.GetProp('IFG_TYPE')
    except:
        print('get IFG TYPE at First')
        
    return one_hot_encoding(ifg, allowable_set, encode_unknown=encode_unknown)

And I used the featurize for AttentiveFP training.

Whole code is uploaded to my gist. ;)

AttentiveFP with IFG.

In this case, atom weights does not reflect functional group but seems model can catch up some feature of functional group I think.

AttentiveFP uses GRU so learning process is complex. I would like to apply the featurizer more simple algorithm such as GCN.

Visualize atom weight of AttentiveFP #DGL #RDKit #Chemoinformatics

Yesterday, I posted an example of DGL (almost same as original example code).

And I could make regression model with my own dataset. Fortunately DGL developer provides a code for visualize atom weights of trained model.

It means that, after building the model with AttentiveFP, you can visualize atom weight of the give molecule which means how much each atom contribute to the target value.

I saw many example of the approach but never tried it by myself. So I tried to it today.

Following code is same as yesterday.

%matplotlib inline 
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDPaths

import dgl
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from dgl import model_zoo

from dgl.data.chem.utils import mol_to_complete_graph, mol_to_bigraph

from dgl.data.chem.utils import atom_type_one_hot
from dgl.data.chem.utils import atom_degree_one_hot
from dgl.data.chem.utils import atom_formal_charge
from dgl.data.chem.utils import atom_num_radical_electrons
from dgl.data.chem.utils import atom_hybridization_one_hot
from dgl.data.chem.utils import atom_total_num_H_one_hot
from dgl.data.chem.utils import one_hot_encoding
from dgl.data.chem import CanonicalAtomFeaturizer
from dgl.data.chem import CanonicalBondFeaturizer
from dgl.data.chem import ConcatFeaturizer
from dgl.data.chem import BaseAtomFeaturizer
from dgl.data.chem import BaseBondFeaturizer

from dgl.data.chem import one_hot_encoding
from dgl.data.utils import split_dataset

from functools import partial
from sklearn.metrics import roc_auc_score
def chirality(atom):
    try:
        return one_hot_encoding(atom.GetProp('_CIPCode'), ['R', 'S']) + \
               [atom.HasProp('_ChiralityPossible')]
    except:
        return [False, False] + [atom.HasProp('_ChiralityPossible')]
    
def collate_molgraphs(data):
    """Batching a list of datapoints for dataloader.
    Parameters
    ----------
    data : list of 3-tuples or 4-tuples.
        Each tuple is for a single datapoint, consisting of
        a SMILES, a DGLGraph, all-task labels and optionally
        a binary mask indicating the existence of labels.
    Returns
    -------
    smiles : list
        List of smiles
    bg : BatchedDGLGraph
        Batched DGLGraphs
    labels : Tensor of dtype float32 and shape (B, T)
        Batched datapoint labels. B is len(data) and
        T is the number of total tasks.
    masks : Tensor of dtype float32 and shape (B, T)
        Batched datapoint binary mask, indicating the
        existence of labels. If binary masks are not
        provided, return a tensor with ones.
    """
    assert len(data[0]) in [3, 4], \
        'Expect the tuple to be of length 3 or 4, got {:d}'.format(len(data[0]))
    if len(data[0]) == 3:
        smiles, graphs, labels = map(list, zip(*data))
        masks = None
    else:
        smiles, graphs, labels, masks = map(list, zip(*data))

    bg = dgl.batch(graphs)
    bg.set_n_initializer(dgl.init.zero_initializer)
    bg.set_e_initializer(dgl.init.zero_initializer)
    labels = torch.stack(labels, dim=0)
    
    if masks is None:
        masks = torch.ones(labels.shape)
    else:
        masks = torch.stack(masks, dim=0)
    return smiles, bg, labels, masks

atom_featurizer = BaseAtomFeaturizer(
                 {'hv': ConcatFeaturizer([
                  partial(atom_type_one_hot, allowable_set=[
                          'B', 'C', 'N', 'O', 'F', 'Si', 'P', 'S', 'Cl', 'As', 'Se', 'Br', 'Te', 'I', 'At'],
                    encode_unknown=True),
                  partial(atom_degree_one_hot, allowable_set=list(range(6))),
                  atom_formal_charge, atom_num_radical_electrons,
                  partial(atom_hybridization_one_hot, encode_unknown=True),
                  lambda atom: [0], # A placeholder for aromatic information,
                    atom_total_num_H_one_hot, chirality
                 ],
                )})
bond_featurizer = BaseBondFeaturizer({
                                     'he': lambda bond: [0 for _ in range(10)]
    })

train=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
test=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

train_graph =[mol_to_bigraph(mol,
                           atom_featurizer=atom_featurizer, 
                           bond_featurizer=bond_featurizer) for mol in train_mols]

test_graph =[mol_to_bigraph(mol,
                           atom_featurizer=atom_featurizer, 
                           bond_featurizer=bond_featurizer) for mol in test_mols]

def run_a_train_epoch(n_epochs, epoch, model, data_loader,loss_criterion, optimizer):
    model.train()
    total_loss = 0
    losses = []
    
    for batch_id, batch_data in enumerate(data_loader):
        batch_data
        smiles, bg, labels, masks = batch_data
        if torch.cuda.is_available():
            bg.to(torch.device('cuda:0'))
            labels = labels.to('cuda:0')
            masks = masks.to('cuda:0')
        
        prediction = model(bg, bg.ndata['hv'], bg.edata['he'])
        loss = (loss_criterion(prediction, labels)*(masks != 0).float()).mean()
        #loss = loss_criterion(prediction, labels)
        #print(loss.shape)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        losses.append(loss.data.item())
        
    #total_score = np.mean(train_meter.compute_metric('rmse'))
    total_score = np.mean(losses)
    print('epoch {:d}/{:d}, training {:.4f}'.format( epoch + 1, n_epochs,  total_score))
    return total_score

model = model_zoo.chem.AttentiveFP(node_feat_size=39,
                                  edge_feat_size=10,
                                  num_layers=2,
                                  num_timesteps=2,
                                  graph_feat_size=200,
                                  output_size=1,
                                  dropout=0.2)
model = model.to('cuda:0')

train_loader = DataLoader(dataset=list(zip(train_smi, train_graph, train_sol)), batch_size=128, collate_fn=collate_molgraphs)
test_loader = DataLoader(dataset=list(zip(test_smi, test_graph, test_sol)), batch_size=128, collate_fn=collate_molgraphs)

loss_fn = nn.MSELoss(reduction='none')
optimizer = torch.optim.Adam(model.parameters(), lr=10 ** (-2.5), weight_decay=10 ** (-5.0),)
n_epochs = 100
epochs = []
scores = []
for e in range(n_epochs):
    score = run_a_train_epoch(n_epochs, e, model, train_loader, loss_fn, optimizer)
    epochs.append(e)
    scores.append(score)
model.eval()

OK, I build the predictive model. (of course build model can save with torch.save(model.state_dict(), PATH) method.)

Let’s visualize molecule with atom weights!

At first, import packages for molecule visualization.

import copy
from rdkit.Chem import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from IPython.display import SVG
from IPython.display import display
import matplotlib
import matplotlib.cm as cm

Then define visualization function. Following code is borrowed from original repository thanks a lot. DGL model has get_node_weight option, which returns node_weight of the graph. The model has two layers of GRU so timestep must be 0 or 1 following code I used 0 as timestep.

def drawmol(idx, dataset, timestep):
    smiles, graph, _ = dataset[idx]
    print(smiles)
    bg = dgl.batch([graph])
    atom_feats, bond_feats = bg.ndata['hv'], bg.edata['he']
    if torch.cuda.is_available():
        print('use cuda')
        bg.to(torch.device('cuda:0'))
        atom_feats = atom_feats.to('cuda:0')
        bond_feats = bond_feats.to('cuda:0')
    
    _, atom_weights = model(bg, atom_feats, bond_feats, get_node_weight=True)
    assert timestep < len(atom_weights), 'Unexpected id for the readout round'
    atom_weights = atom_weights[timestep]
    min_value = torch.min(atom_weights)
    max_value = torch.max(atom_weights)
    atom_weights = (atom_weights - min_value) / (max_value - min_value)
    
    norm = matplotlib.colors.Normalize(vmin=0, vmax=1.28)
    cmap = cm.get_cmap('bwr')
    plt_colors = cm.ScalarMappable(norm=norm, cmap=cmap)
    atom_colors = {i: plt_colors.to_rgba(atom_weights[i].data.item()) for i in range(bg.number_of_nodes())}

    mol = Chem.MolFromSmiles(smiles)
    rdDepictor.Compute2DCoords(mol)
    drawer = rdMolDraw2D.MolDraw2DSVG(280, 280)
    drawer.SetFontSize(1)
    op = drawer.drawOptions()
    
    mol = rdMolDraw2D.PrepareMolForDrawing(mol)
    drawer.DrawMolecule(mol, highlightAtoms=range(bg.number_of_nodes()),
                             highlightBonds=[],
                             highlightAtomColors=atom_colors)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText()
    svg = svg.replace('svg:', '')
    if torch.cuda.is_available():
        atom_weights = atom_weights.to('cpu')
    return (Chem.MolFromSmiles(smiles), atom_weights.data.numpy(), svg)

Draw test dataset molecules. The model predicts solubility and color indicates that red is positive effect for solubility and blue is negative impact.

target = test_loader.dataset
for i in range(len(target)):
    mol, aw, svg = drawmol(i, target, 0)
    display(SVG(svg))

For personally, hydroxyl group has positive effect for solubility I think but the model shows it is not always true. Hmm my code is something wrong? Or I need to think about more details of the model?
I would like to try more predictive task and write helper code for DGL’s AttentiveFP for convenient way to molecular visualization and model building.

Today’s whole code is uploaded below.
https://gist.github.com/iwatobipen/72a2d9dd616322f1f20469a152f2bb58

Any comments or suggestions will be highly appreciated. ;)

Molecular property regression with Attentive FP #RDKit #Chemoinformatics #DGL #DeepGraphLibrary

Recently Molecular Graph based deep learning is hot are in chemoinformatics.
Some months ago, Zhaoping et al. published new graph based QSAR model named ‘Attentive FP’ in JMC.

As its name suggests, Attentive FP uses attention mechanism for its architecture.

The authors disclosed their code. And fortunately, recent version of DGL is also Attentive FP!
Its repository provides an example of molecular property regression with attentive fp. However it is difficult to understand if I would like to use the FP against my dataset.
So I updated DGL and tried to use attentive FP. In the following code I used solubility data that is provided from rdkit for my practice.

First, import several packages for deep learning. DGL has many function for chemoinformatics task. Used doesn’t need implement functions which are required for chemo to graph conversion.

%matplotlib inline 
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDPaths

import dgl
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from dgl import model_zoo

from dgl.data.chem.utils import mol_to_complete_graph, mol_to_bigraph

from dgl.data.chem.utils import atom_type_one_hot
from dgl.data.chem.utils import atom_degree_one_hot
from dgl.data.chem.utils import atom_formal_charge
from dgl.data.chem.utils import atom_num_radical_electrons
from dgl.data.chem.utils import atom_hybridization_one_hot
from dgl.data.chem.utils import atom_total_num_H_one_hot
from dgl.data.chem.utils import one_hot_encoding
from dgl.data.chem import CanonicalAtomFeaturizer
from dgl.data.chem import CanonicalBondFeaturizer
from dgl.data.chem import ConcatFeaturizer
from dgl.data.chem import BaseAtomFeaturizer
from dgl.data.chem import BaseBondFeaturizer

from dgl.data.chem import one_hot_encoding
from dgl.data.utils import split_dataset

from functools import partial
from sklearn.metrics import roc_auc_score

Then I defined some helper function for the task. Almost of the codes are borrowed from original dgl/example. Thanks for sharing the nice code!

def chirality(atom):
    try:
        return one_hot_encoding(atom.GetProp('_CIPCode'), ['R', 'S']) + \
               [atom.HasProp('_ChiralityPossible')]
    except:
        return [False, False] + [atom.HasProp('_ChiralityPossible')]
    
def collate_molgraphs(data):
    """Batching a list of datapoints for dataloader.
    Parameters
    ----------
    data : list of 3-tuples or 4-tuples.
        Each tuple is for a single datapoint, consisting of
        a SMILES, a DGLGraph, all-task labels and optionally
        a binary mask indicating the existence of labels.
    Returns
    -------
    smiles : list
        List of smiles
    bg : BatchedDGLGraph
        Batched DGLGraphs
    labels : Tensor of dtype float32 and shape (B, T)
        Batched datapoint labels. B is len(data) and
        T is the number of total tasks.
    masks : Tensor of dtype float32 and shape (B, T)
        Batched datapoint binary mask, indicating the
        existence of labels. If binary masks are not
        provided, return a tensor with ones.
    """
    assert len(data[0]) in [3, 4], \
        'Expect the tuple to be of length 3 or 4, got {:d}'.format(len(data[0]))
    if len(data[0]) == 3:
        smiles, graphs, labels = map(list, zip(*data))
        masks = None
    else:
        smiles, graphs, labels, masks = map(list, zip(*data))

    bg = dgl.batch(graphs)
    bg.set_n_initializer(dgl.init.zero_initializer)
    bg.set_e_initializer(dgl.init.zero_initializer)
    labels = torch.stack(labels, dim=0)
    
    if masks is None:
        masks = torch.ones(labels.shape)
    else:
        masks = torch.stack(masks, dim=0)
    return smiles, bg, labels, masks


def run_a_train_epoch(n_epochs, epoch, model, data_loader,loss_criterion, optimizer):
    model.train()
    total_loss = 0
    losses = []
    
    for batch_id, batch_data in enumerate(data_loader):
        batch_data
        smiles, bg, labels, masks = batch_data
        if torch.cuda.is_available():
            bg.to(torch.device('cuda:0'))
            labels = labels.to('cuda:0')
            masks = masks.to('cuda:0')
        
        prediction = model(bg, bg.ndata['hv'], bg.edata['he'])
        loss = (loss_criterion(prediction, labels)*(masks != 0).float()).mean()
        #loss = loss_criterion(prediction, labels)
        #print(loss.shape)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        losses.append(loss.data.item())
        

    total_score = np.mean(losses)
    print('epoch {:d}/{:d}, training {:.4f}'.format( epoch + 1, n_epochs,  total_score))
    return total_score

After that, I defined atom and bond featurizer functions. Their settings are same as original repository but it is easy to modify the featurizer.

atom_featurizer = BaseAtomFeaturizer(
                 {'hv': ConcatFeaturizer([
                  partial(atom_type_one_hot, allowable_set=[
                          'B', 'C', 'N', 'O', 'F', 'Si', 'P', 'S', 'Cl', 'As', 'Se', 'Br', 'Te', 'I', 'At'],
                    encode_unknown=True),
                  partial(atom_degree_one_hot, allowable_set=list(range(6))),
                  atom_formal_charge, atom_num_radical_electrons,
                  partial(atom_hybridization_one_hot, encode_unknown=True),
                  lambda atom: [0], # A placeholder for aromatic information,
                    atom_total_num_H_one_hot, chirality
                 ],
                )})
bond_featurizer = BaseBondFeaturizer({
                                     'he': lambda bond: [0 for _ in range(10)]
    })

If you would like to the featurizer as same as DeepChem, you can use CanonicalAtom/BondFeaturizer.
https://docs.dgl.ai/en/latest/generated/dgl.data.chem.CanonicalAtomFeaturizer.html
https://docs.dgl.ai/en/latest/generated/dgl.data.chem.CanonicalBondFeaturizer.html

DGL seems friendly for chemoinformatitian I think.

OK, let’s load dataset. mol_to_bigraph method with featurizer converts rdkit mol object to graph object. Also, smiles_to_bigraph method can convert smiles to graph! Cool ;)

train=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
test=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

train_mols = Chem.SDMolSupplier(train)
train_smi =[Chem.MolToSmiles(m) for m in train_mols]
train_sol = torch.tensor([float(mol.GetProp('SOL')) for mol in train_mols]).reshape(-1,1)

test_mols =  Chem.SDMolSupplier(test)
test_smi = [Chem.MolToSmiles(m) for m in test_mols]
test_sol = torch.tensor([float(mol.GetProp('SOL')) for mol in test_mols]).reshape(-1,1)


train_graph =[mol_to_bigraph(mol,
                           atom_featurizer=atom_featurizer, 
                           bond_featurizer=bond_featurizer) for mol in train_mols]

test_graph =[mol_to_bigraph(mol,
                           atom_featurizer=atom_featurizer, 
                           bond_featurizer=bond_featurizer) for mol in test_mols]

AttentivFp model is provided from model_zoo. And define dataloader for training and test.

model = model_zoo.chem.AttentiveFP(node_feat_size=39,
                                  edge_feat_size=10,
                                  num_layers=2,
                                  num_timesteps=2,
                                  graph_feat_size=200,
                                  output_size=1,
                                  dropout=0.2)
model = model.to('cuda:0')

train_loader = DataLoader(dataset=list(zip(train_smi, train_graph, train_sol)), batch_size=128, collate_fn=collate_molgraphs)
test_loader = DataLoader(dataset=list(zip(test_smi, test_graph, test_sol)), batch_size=128, collate_fn=collate_molgraphs)

model = model_zoo.chem.AttentiveFP(node_feat_size=39,
edge_feat_size=10,
num_layers=2,
num_timesteps=2,
graph_feat_size=200,
output_size=1,
dropout=0.2)
model = model.to('cuda:0')

train_loader = DataLoader(dataset=list(zip(train_smi, train_graph, train_sol)), batch_size=128, collate_fn=collate_molgraphs)
test_loader = DataLoader(dataset=list(zip(test_smi, test_graph, test_sol)), batch_size=128, collate_fn=collate_molgraphs)

Dataloader is pytorch native class. It generates iterator of butch of dataset.
Now almost there! Let’s go to learning process.

loss_fn = nn.MSELoss(reduction='none')
optimizer = torch.optim.Adam(model.parameters(), lr=10 ** (-2.5), weight_decay=10 ** (-5.0),)
n_epochs = 100
epochs = []
scores = []
for e in range(n_epochs):
    score = run_a_train_epoch(n_epochs, e, model, train_loader, loss_fn, optimizer)
    epochs.append(e)
    scores.append(score)

>>>output is below.
epoch 1/100, training 8.8096
----snip---
epoch 98/100, training 0.3706
epoch 99/100, training 0.3915
epoch 100/100, training 0.3003
plt.plot(epochs, scores)

It seems that learning process goes well ;).

OK let’s validate the model!

model.eval()
all_pred = []
for test_data in test_loader:
    smi_lst, bg, labels, masks = test_data
    if torch.cuda.is_available():
            bg.to(torch.device('cuda:0'))
            labels = labels.to('cuda:0')
            masks = masks.to('cuda:0')
    pred = model(bg, bg.ndata['hv'], bg.edata['he'])
    all_pred.append(pred.data.cpu().numpy())
res = np.vstack(all_pred)
plt.clf()
plt.scatter(res, test_sol)
plt.xlabel('pred')
plt.ylabel('exp')
from sklearn.metrics import r2_score
print(r2_score(test_sol, res))
> 0.9098691301661277

Let’s compare to RandomForest.

from sklearn.ensemble import RandomForestRegressor
from rdkit import Chem
from rdkit.Chem import AllChem
train_fp = [AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in train_mols]
test_fp = [AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in test_mols]
# make RF regressor and train it.
rfr = RandomForestRegressor()
rfr.fit(train_fp, train_sol)

Check the performance.

rfr_pred = rfr.predict(test_fp)
r2_score(test_sol, rfr_pred)
plt.clf()
plt.scatter(rfr_pred, test_sol)

AttentiveFP model showed high performance for solubility prediction in this case.(The my code of RandomForest is not optimized.) DGL example code is very useful for beginner of DGL but it is difficult to apply to my own dataset. So I need to rewrite the code with my dataset.

Any way, I would like to buy Beer to DGL developper. DGL is very nice package for chemoinformatics and ‘RDKitds’. RDKits is new nickname of rdkit user, it is proposed in RDKit UGM 2019 ;)
Today’s code is below.

my whole code.

Thanks,