Conformal prediction with python and rdkit_2 #RDKit #QSAR #Conformal_prediction

I posted about conformal prediction with python and rdkit some days ago. After that I could get very informative advice from @kjelljorner. Thanks a lot! His advice was below.


Kjell Jorner @kjelljorner3dReplying to @iwatobipen I can recommend the cross conformal prediction or bootstrapped conformal prediction (also in nonconformist) to avoid having to put aside data for calibration. There is also a related method called the jackknife+ which can be applied to any method:

Fortunately nonconformist has some implementations of ACP. So I tried to use acp module with same dataset.

For convenience, shared my gist link below instead of pasting code on the blog post.

In the post, I tested AggregatedCp and BootstrapConfomralClassifier. These methods internally make train and calibration set with defined sampling method. In the default setting, aggregatedCp uses bootstrapsampler. So both method is almost same I think. And results are almost same.

By using acp method, used don’t need to separate train data to train and calibration data before training. It seems natural and useful way to use conformal prediction against chemoinformatics problems.

BTW, in the these area, there are many skillful people in the world and they have an open mind. I feel writing the post is useful for me because it keeps my idea, tips code(this is main reason for keeping the blog) but also gives opportunity for having a discussion about chemoinformatics and drug discovery. ;)

Compare the view point of different QSAR models #RDKit #visualize #chemoinformatics

Some days ago, I posted how to visualize SVG images horizontally and ngboost for QSAR problem. It worked well. And I found that different models showed different performance. So my question is that which point each model detects important for molecular properties.

Fortunately rdkit has GetSimilarityMapForModel method which can render the probe molecule with model’s feature importance.

Today I tried to combine previous method and this useful rdkit method to compare the models feature. At first import required packages.

import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
import numpy as np
import io
from PIL import Image
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from ngboost import NGBClassifier
from ngboost.ngboost import NGBoost
from ngboost.learners import default_tree_learner
from ngboost.distns import Normal, LogNormal
from ngboost.distns import k_categorical
from sklearn.metrics import classification_report
from functools import partial
from rdkit.Chem import Draw
from IPython.display import SVG

Then read train and test data, calculate fingerprint and get target values.

train = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
train_mols = [m for m in Chem.SDMolSupplier(train)]
test = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
test_mols = [m for m in Chem.SDMolSupplier(test)]

val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}


def mol2arr(mol, radi=2, nBits=1024):
    arr = np.zeros((1,))
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radi, nBits)
    DataStructs.ConvertToNumpyArray(fp, arr)

    return arr

trainX = np.array([mol2arr(m) for m in train_mols])
trainY = np.array([val_dict[m.GetProp('SOL_classification')] for m in train_mols])

testX = np.array([mol2arr(m) for m in test_mols])
testY = np.array([val_dict[m.GetProp('SOL_classification')] for m in test_mols])

Next, define classifier models. And train them.

adbcls = AdaBoostClassifier()
svc = SVC(probability=True)
ngbc = NGBClassifier(Dist=k_categorical(3))
clsset = [adbcls, svc, ngbc]
for clsfier in clsset:
    clsfier.fit(trainX, trainY)

Then I made subgroup which is grouped by solubility class.

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']

Then I defined drawmol function, the function makes SVG text of similaritymap for model. I passed MolDraw2DSVG to draw2d options, it means that GetSimilarityMapForModel draw the output to SVG drawer. And defined HorizontalDisplay class as same as previous post.

def drawmol(mols, idx, predictfunc):
    d = Draw.MolDraw2DSVG(200,250)
    fig, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, predictfunc.predict_proba),
                                                           draw2d = d,
                                                           colorMap='coolwarm', #this option wasn't applied why?? 
                                                           size=(100,100),
                                                           contourLines=0,
                                                           step=0.001,
                                                           alpha=0.2,
                                                           )
    d.FinishDrawing()
    txt = d.GetDrawingText()
    return txt

class HorizontalDisplay:
    def __init__(self, svgtxtlist):
        self.svgtxtlist = svgtxtlist

    def _repr_html_(self):
        template = '<div style="float:left;padding:10px;">{0}</div>'
        return "\n".join(template.format(svgtxt)
                         for svgtxt in self.svgtxtlist)

Now ready. Let’s display molecules with these methods.

Following example are selected from high/low test molecules.

#Adaboost, SVC, NGBOOST
svgset = []
for cls in clsset:
svgset.append(drawmol(high_test_mols, 3, cls))
HorizontalDisplay(svgset)

Adaboost classifier strongly responded tert-methyl group compared to other models. Ngboost classifier says methyl groups are not good for solubility.

Next example, Adaboost couldn’t phenol OH groups are positive for solubility but SVC and NGBoost said OH is important for solubility. But these two modes showed different response to aromatic carbon.

By using the method, user can reveal that where is important point for the model. So it will be good material for discussion with chemists I think.

But the method is open to some debate. Coloring, normalize atom features, etc…

QSAR model should not be used as just a black box which return positive or negative, it should be used for discussion and good decision making.

Today’s code can be found following URL. Thanks for reading.

https://github.com/iwatobipen/playground/blob/master/ngboost2.ipynb

Draw molecules as SVG in horizontal layout #Drawing #RDKit #memo

As you know, Greg posted cool code about new drawing code options of rdkit 202003. You can read details of them in following URL
http://rdkit.blogspot.com/2020/04/new-drawing-options-in-202003-release.html

It’s really cool! New version of rdkit can render molecule with many options in high quarity.

In the post, molecules are rendered as SVG image one molecule per one cell.

I would like to draw molecules in one cells like Draw.MolsToGridImage.

So I traced the blog post and try to render molecules in one cell. I found good solution to do it, the strategy is that get svg text of molecules and embed them in to the div tag and then draw it as HTML.

OK let’s try it. Following code is almost same but example molecules are different. Defined HorizontalDisplay class for displaying multiple molecules at later.

import os
from rdkit import Chem
from rdkit.Chem import rdBase
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdDepictor
import rdkit
from IPython.display import SVG

from rdkit.Chem import ChemicalFeatures
fdef = os.path.join(RDConfig.RDContribDir, 'M_Kossner/BaseFeatures_DIP2_NoMicrospecies.fdef')
ffact = ChemicalFeatures.BuildFeatureFactory(fdef)
rdDepictor.SetPreferCoordGen(True)

# the code is bollowed from following url.
# https://bit.ly/3aTUBCU
class HorizontalDisplay:
    def __init__(self, *args):
        self.args = args

    def _repr_html_(self):
        template = '<div data-carousel-extra='{"blog_id":39198697,"permalink":"https:\/\/iwatobipen.wordpress.com\/2020\/05\/01\/draw-molecules-as-svg-in-horizontal-layout-drawing-rdkit-memo\/"}' style="float:left;padding:10px;">{0}</div>'
        return "\n".join(template.format(arg)
                         for arg in self.args)

Then draw molecules one by one and get SVGdrawing text of each drawing setting. First two codes are draw molecule with / without atom indices. And 3rd, 4th codes are draw molecules with atom features.

d2d = rdMolDraw2D.MolDraw2DSVG(250, 200)
d2d.DrawMolecule(mol1)
d2d.FinishDrawing()
text1 = d2d.GetDrawingText()

d2d = rdMolDraw2D.MolDraw2DSVG(250, 200)
d2d.drawOptions().addAtomIndices = True
d2d.DrawMolecule(mol1)
d2d.FinishDrawing()
text2 = d2d.GetDrawingText()

from collections import defaultdict
feats = ffact.GetFeaturesForMol(mol1)
colors = {'SingleAtomDonor':(1,0.5,0.5),
          'SingleAtomAcceptor':(0.5,0.5,1),
          'BasicGroup':(0,1,1),
          'Arom6':(1,0.8,0.5),
          'Hphobe':(0.7,0.7,0.3)}

atomHighlighs = defaultdict(list)
highlightRads = {}
for feat in feats:
    if feat.GetType() in colors:
        clr = colors[feat.GetType()]
        for aid in feat.GetAtomIds():
            atomHighlighs[aid].append(clr)
            highlightRads[aid] = 0.4

d2d = rdMolDraw2D.MolDraw2DSVG(250, 200)
d2d.DrawMoleculeWithHighlights(mol1, '', dict(atomHighlighs), {}, highlightRads, {})
d2d.drawOptions().addAtomIndices = True
d2d.FinishDrawing()
text3 = d2d.GetDrawingText()

d2d = rdMolDraw2D.MolDraw2DSVG(250, 200)
dos = d2d.drawOptions()
dos.atomHighlightsAreCricle = True
dos.fillHighlights = False
d2d.DrawMoleculeWithHighlights(mol1, '', dict(atomHighlighs), {}, highlightRads, {})
d2d.drawOptions().addAtomIndices = True
d2d.FinishDrawing()
text4 = d2d.GetDrawingText()

Now I had 4 svg texts. Finally call HorizontalDisplay.

HorizontalDisplay(text1, text2, text3, text4)

Now I could get multiple molecules in one cell ;).

It can’t control the number of the molecules in one row, so it seems there is room for improvement. Any comments and suggestions will be greatly appreciated.

I uploaded the code my github repo and gist. And hope reader will have a nice weekend. (gist couldn’t draw molecules per row)

https://github.com/iwatobipen/playground/blob/master/Horizontal_display_SVGmols.ipynb

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/Horizontal_display_SVGmols.ipynb

One liner command tool for LillyMedChemRules #Chemoinformatics #memo

There are many substructure files are available in these days. And LillyMedChem Rules is one of useful and famous filter. It works very fast and provides reasonable results.

However the implementation returns the result as multiple files. So user need to marge files after filtration.

So I wrote small script to conduct filter the molecules and marge all generated files.

I uploaded the code to my repository. ;)
https://github.com/iwatobipen/lillymcrules

To use the code, user need to install LillyMedChemRules at first. And modify the FILTERPATH of the lillymcf.py to where you installed the filter.

Then install my code with pip.

$ git clone https://github.com/iwatobipen/lillymcrules.git
$ cd lillymcrules
# edit lillymcf.py
$ pip install .

Now ready to use the script.

After installing the code, ‘lillyfilter’ command will be available.

To use the script, it is very simple. Let me show….

tests$ ls
example_molecules.smi
tests$ lillyfilter example_molecules.smi 
Done
tests$ ls
bad0.smi  bad1.smi  bad2.smi  bad3.smi  example_molecules.smi  marged.csv  ok0.log  ok1.log  ok2.log  ok3.log  ok.smi

The script call Lilly_Medchem_Rules.rb script of original code so most of output file is same. But marged.csv is generated additionally.

The file has all molecules which are generated by the filter.

tests$ head -n 5 marged.csv 
ClC1=CC=CC=C1C=CC1=NC2=CC=CC3=C2C(=CC=C3)N1C,PBCHM1719983,passed,
S(C)C(=N(=O)C)CC,PBCHM6386587,passed,
ON(N(CCNCC)CC)N=O,PBCHM4516,passed,
S(C)CCC(=O)C(=CO)O,PBCHM562,passed,
ClCCN(CN1NC(=O)C=CC1=O)CCCl,PBCHM260812,passed,

tests$ tail -n5 marged.csv 
C1C2=C[C+](CC(=CC1=CC=CC=C2)C=C)C=CC,PBCHM2751984,bad0,TP1 no_interesting_atoms
FC(F)(C(F)(F)C(F)(F)F)C(F)(F)F,PBCHM9638,bad0,TP1 no_interesting_atoms
OC(CO)C=O,PBCHM751,bad0,TP1 not_enough_atoms
C(CCC)C1=CC(=C1)C,PBCHM11832205,bad0,TP1 no_interesting_atoms
S(=O)(=O)(O)OC1C(OC(=O)CC(C)C)C(OC2CC3(C)C4C5(CC(CC4)C(=C)C5O)CCC3C(C2)C(=O)O)OC(C1OS(=O)(=O)O)CO,PBCHM2255,bad0,TP1 too_many_atoms

The file has molid, smiles and description and ready to use rdkit and pandas.

By using click package, it is easy to develop command line tools. ;)

Any comments, suggestions Pull requests will be greatly appreciated.

Cut molecule to ring and linker with RDKit #RDKit #chemoinformatics #memo

Sometime chemists analyze molecule as small parts such as core, linker and substituents.

RDKit has functions for molecular decomposition RECAP, BRICS and rdMMPA. It’s useful functions but these functions can’t extract directly extract core and linker from molecule.

I had interested how to do it and tried it.

Following code, Core is defined Rings in the molecule and Linker is defined chain which connects two rings.

I defined main function named ‘getLinkerbond’ which extracts bonds that connects atom in ring and atom not in ring.

Then call Chem.FragmentOnBonds function to get fragments by cutting linker bond.

Let’s see the code. First, import some libraries. And define test molecules.

import copy
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import display

mol1 = Chem.MolFromSmiles("C1CCNCC1")
mol2 = Chem.MolFromSmiles("c1ccccc1")
mol3 = Chem.MolFromSmiles("C1CC1CC")
mol4 = Chem.MolFromSmiles("C1CC1CCC2CCOC2")
mol5 = Chem.MolFromSmiles("C1CCC2CCCCC2C1")
mol6 = Chem.MolFromSmiles("OC1CC2(C1)CCCC2")
mol7 = Chem.MolFromSmiles("CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N")
mol8 = Chem.MolFromSmiles("c1ccccc1c2ccccc2OC")
mol9 = Chem.MolFromSmiles("CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5")
mol10 = Chem.MolFromSmiles("C1CCCC1CCCOCOCOCOCOCc1ccccc1")
mols = [mol1,mol2,mol3,mol4,mol5,mol6,mol7,mol8,mol9,mol10]

Then define main function. I wrote helper function for the main function. is_in_samering function check the atoms which construct a bond are in same ring or not.

def is_in_samering(idx1, idx2, bond_rings):
    for bond_ring in bond_rings:
        if idx1 in bond_ring and idx2 in bond_ring:
            return True
    return False

Following code, I set atom prop at first to keep original atom index. Because GetScaffoldForMol function returns new molecule with new atom index. I would like to use original atom idex in this function.

def getLinkerbond(mol, useScaffold=True):
    res = []
    for atom in mol.GetAtoms():
        atom.SetIntProp("orig_idx", atom.GetIdx())
    for bond in mol.GetBonds():
        bond.SetIntProp("orig_idx", bond.GetIdx())
    
    if useScaffold:
        mol = MurckoScaffold.GetScaffoldForMol(mol)
        
    ring_info = mol.GetRingInfo()
    bond_rings = ring_info.BondRings()
    ring_bonds = set()
    for ring_bond_idxs in bond_rings:
        for idx in ring_bond_idxs:
            ring_bonds.add(idx)
    all_bonds_idx = [bond.GetIdx() for bond in mol.GetBonds()]
    none_ring_bonds = set(all_bonds_idx) - ring_bonds
    for bond_idx in none_ring_bonds:
        bgn_idx = mol.GetBondWithIdx(bond_idx).GetBeginAtomIdx()
        end_idx = mol.GetBondWithIdx(bond_idx).GetEndAtomIdx()
        if mol.GetBondWithIdx(bond_idx).GetBondTypeAsDouble() == 1.0:
            if mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 1:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
                res.append(orig_idx)
            elif not is_in_samering(bgn_idx, end_idx, bond_rings) and mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 2:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
                res.append(orig_idx)
    return res

Check the function.


for mol in mols:
    bonds = getLinkerbond(mol)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)
        display(res)
    else:
        display(mol)

The function cut murckoscaffold structure as default. So bonds not connect rings is not cut.

Next, case of useScaffold False. In this case all bonds outside of rings are cut.

for mol in mols:
    bonds = getLinkerbond(mol, False)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)
        display(res)
    else:
        display(mol)

Works fine.

It is useful for molecular decomposition at linker site I think.

Today’s whole code is uploaded my gist. Any comments and advices will be high appreciated. ;)

New trial of AttentiveFP with new atom feature #DGL #RDKit #Chemoinformatics

Recently I posted an example of AttentiveFP and I found that atom weights doesn’t directly reflect functional groups. And I could get useful suggestion via comment from DGL developper!

And I wonder that how about to use functional group feature to train the model.

But how can I detect functional groups in the molecule? Because functional group is human defined feature.

…. Fortunately, as you know! RDKit has useful function to extract functional group automatically!

Original article is below.
An algorithm to identify functional groups in organic molecules Peter Ertl https://jcheminf.springeropen.com/articles/10.1186/s13321-017-0225-z

And the implementation was found in following URL.
https://github.com/rdkit/rdkit/tree/master/Contrib/IFG

So I used the function to define new atom featurizer. The code is below. The util function can detect functional group of molecule and add the type to atom property. It can use for atom featurizer.

#ifgutil.py
import sys
import os
from rdkit import Chem
from rdkit import RDPaths
from collections import defaultdict
from dgl.data.chem.utils import one_hot_encoding

ifg_path = os.path.join(RDPaths.RDContribDir, 'IFG')
sys.path.append(ifg_path)

import ifg


def map_fgs(mol):
    atoms = list(mol.GetAtoms())
    for atom in atoms:
        atom.SetProp("IFG_TYPE", "")
    fgs = ifg.identify_functional_groups(mol)
    for fg in fgs:
        for atmid in fg.atomIds:
            atom = mol.GetAtomWithIdx(atmid)
            atom.SetProp('IFG_TYPE', fg.type)
    return mol

def make_ifg_list(mols):
    res = set()
    for mol in mols:
        for atom in mol.GetAtoms():
            ifg = atom.GetProp('IFG_TYPE')
            res.add(ifg)
    return list(res)

def atom_ifg_one_hot(atom, allowable_set=None, encode_unknown=False):
    if allowable_set is None:
        raise Exception
    try:
        ifg = atom.GetProp('IFG_TYPE')
    except:
        print('get IFG TYPE at First')
        
    return one_hot_encoding(ifg, allowable_set, encode_unknown=encode_unknown)

And I used the featurize for AttentiveFP training.

Whole code is uploaded to my gist. ;)

AttentiveFP with IFG.

In this case, atom weights does not reflect functional group but seems model can catch up some feature of functional group I think.

AttentiveFP uses GRU so learning process is complex. I would like to apply the featurizer more simple algorithm such as GCN.

Molecular property regression with Attentive FP #RDKit #Chemoinformatics #DGL #DeepGraphLibrary

Recently Molecular Graph based deep learning is hot are in chemoinformatics.
Some months ago, Zhaoping et al. published new graph based QSAR model named ‘Attentive FP’ in JMC.

As its name suggests, Attentive FP uses attention mechanism for its architecture.

The authors disclosed their code. And fortunately, recent version of DGL is also Attentive FP!
Its repository provides an example of molecular property regression with attentive fp. However it is difficult to understand if I would like to use the FP against my dataset.
So I updated DGL and tried to use attentive FP. In the following code I used solubility data that is provided from rdkit for my practice.

First, import several packages for deep learning. DGL has many function for chemoinformatics task. Used doesn’t need implement functions which are required for chemo to graph conversion.

%matplotlib inline 
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDPaths

import dgl
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from dgl import model_zoo

from dgl.data.chem.utils import mol_to_complete_graph, mol_to_bigraph

from dgl.data.chem.utils import atom_type_one_hot
from dgl.data.chem.utils import atom_degree_one_hot
from dgl.data.chem.utils import atom_formal_charge
from dgl.data.chem.utils import atom_num_radical_electrons
from dgl.data.chem.utils import atom_hybridization_one_hot
from dgl.data.chem.utils import atom_total_num_H_one_hot
from dgl.data.chem.utils import one_hot_encoding
from dgl.data.chem import CanonicalAtomFeaturizer
from dgl.data.chem import CanonicalBondFeaturizer
from dgl.data.chem import ConcatFeaturizer
from dgl.data.chem import BaseAtomFeaturizer
from dgl.data.chem import BaseBondFeaturizer

from dgl.data.chem import one_hot_encoding
from dgl.data.utils import split_dataset

from functools import partial
from sklearn.metrics import roc_auc_score

Then I defined some helper function for the task. Almost of the codes are borrowed from original dgl/example. Thanks for sharing the nice code!

def chirality(atom):
    try:
        return one_hot_encoding(atom.GetProp('_CIPCode'), ['R', 'S']) + \
               [atom.HasProp('_ChiralityPossible')]
    except:
        return [False, False] + [atom.HasProp('_ChiralityPossible')]
    
def collate_molgraphs(data):
    """Batching a list of datapoints for dataloader.
    Parameters
    ----------
    data : list of 3-tuples or 4-tuples.
        Each tuple is for a single datapoint, consisting of
        a SMILES, a DGLGraph, all-task labels and optionally
        a binary mask indicating the existence of labels.
    Returns
    -------
    smiles : list
        List of smiles
    bg : BatchedDGLGraph
        Batched DGLGraphs
    labels : Tensor of dtype float32 and shape (B, T)
        Batched datapoint labels. B is len(data) and
        T is the number of total tasks.
    masks : Tensor of dtype float32 and shape (B, T)
        Batched datapoint binary mask, indicating the
        existence of labels. If binary masks are not
        provided, return a tensor with ones.
    """
    assert len(data[0]) in [3, 4], \
        'Expect the tuple to be of length 3 or 4, got {:d}'.format(len(data[0]))
    if len(data[0]) == 3:
        smiles, graphs, labels = map(list, zip(*data))
        masks = None
    else:
        smiles, graphs, labels, masks = map(list, zip(*data))

    bg = dgl.batch(graphs)
    bg.set_n_initializer(dgl.init.zero_initializer)
    bg.set_e_initializer(dgl.init.zero_initializer)
    labels = torch.stack(labels, dim=0)
    
    if masks is None:
        masks = torch.ones(labels.shape)
    else:
        masks = torch.stack(masks, dim=0)
    return smiles, bg, labels, masks


def run_a_train_epoch(n_epochs, epoch, model, data_loader,loss_criterion, optimizer):
    model.train()
    total_loss = 0
    losses = []
    
    for batch_id, batch_data in enumerate(data_loader):
        batch_data
        smiles, bg, labels, masks = batch_data
        if torch.cuda.is_available():
            bg.to(torch.device('cuda:0'))
            labels = labels.to('cuda:0')
            masks = masks.to('cuda:0')
        
        prediction = model(bg, bg.ndata['hv'], bg.edata['he'])
        loss = (loss_criterion(prediction, labels)*(masks != 0).float()).mean()
        #loss = loss_criterion(prediction, labels)
        #print(loss.shape)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        losses.append(loss.data.item())
        

    total_score = np.mean(losses)
    print('epoch {:d}/{:d}, training {:.4f}'.format( epoch + 1, n_epochs,  total_score))
    return total_score

After that, I defined atom and bond featurizer functions. Their settings are same as original repository but it is easy to modify the featurizer.

atom_featurizer = BaseAtomFeaturizer(
                 {'hv': ConcatFeaturizer([
                  partial(atom_type_one_hot, allowable_set=[
                          'B', 'C', 'N', 'O', 'F', 'Si', 'P', 'S', 'Cl', 'As', 'Se', 'Br', 'Te', 'I', 'At'],
                    encode_unknown=True),
                  partial(atom_degree_one_hot, allowable_set=list(range(6))),
                  atom_formal_charge, atom_num_radical_electrons,
                  partial(atom_hybridization_one_hot, encode_unknown=True),
                  lambda atom: [0], # A placeholder for aromatic information,
                    atom_total_num_H_one_hot, chirality
                 ],
                )})
bond_featurizer = BaseBondFeaturizer({
                                     'he': lambda bond: [0 for _ in range(10)]
    })

If you would like to the featurizer as same as DeepChem, you can use CanonicalAtom/BondFeaturizer.
https://docs.dgl.ai/en/latest/generated/dgl.data.chem.CanonicalAtomFeaturizer.html
https://docs.dgl.ai/en/latest/generated/dgl.data.chem.CanonicalBondFeaturizer.html

DGL seems friendly for chemoinformatitian I think.

OK, let’s load dataset. mol_to_bigraph method with featurizer converts rdkit mol object to graph object. Also, smiles_to_bigraph method can convert smiles to graph! Cool ;)

train=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
test=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

train_mols = Chem.SDMolSupplier(train)
train_smi =[Chem.MolToSmiles(m) for m in train_mols]
train_sol = torch.tensor([float(mol.GetProp('SOL')) for mol in train_mols]).reshape(-1,1)

test_mols =  Chem.SDMolSupplier(test)
test_smi = [Chem.MolToSmiles(m) for m in test_mols]
test_sol = torch.tensor([float(mol.GetProp('SOL')) for mol in test_mols]).reshape(-1,1)


train_graph =[mol_to_bigraph(mol,
                           atom_featurizer=atom_featurizer, 
                           bond_featurizer=bond_featurizer) for mol in train_mols]

test_graph =[mol_to_bigraph(mol,
                           atom_featurizer=atom_featurizer, 
                           bond_featurizer=bond_featurizer) for mol in test_mols]

AttentivFp model is provided from model_zoo. And define dataloader for training and test.

model = model_zoo.chem.AttentiveFP(node_feat_size=39,
                                  edge_feat_size=10,
                                  num_layers=2,
                                  num_timesteps=2,
                                  graph_feat_size=200,
                                  output_size=1,
                                  dropout=0.2)
model = model.to('cuda:0')

train_loader = DataLoader(dataset=list(zip(train_smi, train_graph, train_sol)), batch_size=128, collate_fn=collate_molgraphs)
test_loader = DataLoader(dataset=list(zip(test_smi, test_graph, test_sol)), batch_size=128, collate_fn=collate_molgraphs)

model = model_zoo.chem.AttentiveFP(node_feat_size=39,
edge_feat_size=10,
num_layers=2,
num_timesteps=2,
graph_feat_size=200,
output_size=1,
dropout=0.2)
model = model.to('cuda:0')

train_loader = DataLoader(dataset=list(zip(train_smi, train_graph, train_sol)), batch_size=128, collate_fn=collate_molgraphs)
test_loader = DataLoader(dataset=list(zip(test_smi, test_graph, test_sol)), batch_size=128, collate_fn=collate_molgraphs)

Dataloader is pytorch native class. It generates iterator of butch of dataset.
Now almost there! Let’s go to learning process.

loss_fn = nn.MSELoss(reduction='none')
optimizer = torch.optim.Adam(model.parameters(), lr=10 ** (-2.5), weight_decay=10 ** (-5.0),)
n_epochs = 100
epochs = []
scores = []
for e in range(n_epochs):
    score = run_a_train_epoch(n_epochs, e, model, train_loader, loss_fn, optimizer)
    epochs.append(e)
    scores.append(score)

>>>output is below.
epoch 1/100, training 8.8096
----snip---
epoch 98/100, training 0.3706
epoch 99/100, training 0.3915
epoch 100/100, training 0.3003
plt.plot(epochs, scores)

It seems that learning process goes well ;).

OK let’s validate the model!

model.eval()
all_pred = []
for test_data in test_loader:
    smi_lst, bg, labels, masks = test_data
    if torch.cuda.is_available():
            bg.to(torch.device('cuda:0'))
            labels = labels.to('cuda:0')
            masks = masks.to('cuda:0')
    pred = model(bg, bg.ndata['hv'], bg.edata['he'])
    all_pred.append(pred.data.cpu().numpy())
res = np.vstack(all_pred)
plt.clf()
plt.scatter(res, test_sol)
plt.xlabel('pred')
plt.ylabel('exp')
from sklearn.metrics import r2_score
print(r2_score(test_sol, res))
> 0.9098691301661277

Let’s compare to RandomForest.

from sklearn.ensemble import RandomForestRegressor
from rdkit import Chem
from rdkit.Chem import AllChem
train_fp = [AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in train_mols]
test_fp = [AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in test_mols]
# make RF regressor and train it.
rfr = RandomForestRegressor()
rfr.fit(train_fp, train_sol)

Check the performance.

rfr_pred = rfr.predict(test_fp)
r2_score(test_sol, rfr_pred)
plt.clf()
plt.scatter(rfr_pred, test_sol)

AttentiveFP model showed high performance for solubility prediction in this case.(The my code of RandomForest is not optimized.) DGL example code is very useful for beginner of DGL but it is difficult to apply to my own dataset. So I need to rewrite the code with my dataset.

Any way, I would like to buy Beer to DGL developper. DGL is very nice package for chemoinformatics and ‘RDKitds’. RDKits is new nickname of rdkit user, it is proposed in RDKit UGM 2019 ;)
Today’s code is below.

my whole code.

Thanks,