Cut molecule to ring and linker with RDKit #RDKit #chemoinformatics #memo

Sometime chemists analyze molecule as small parts such as core, linker and substituents.

RDKit has functions for molecular decomposition RECAP, BRICS and rdMMPA. It’s useful functions but these functions can’t extract directly extract core and linker from molecule.

I had interested how to do it and tried it.

Following code, Core is defined Rings in the molecule and Linker is defined chain which connects two rings.

I defined main function named ‘getLinkerbond’ which extracts bonds that connects atom in ring and atom not in ring.

Then call Chem.FragmentOnBonds function to get fragments by cutting linker bond.

Let’s see the code. First, import some libraries. And define test molecules.

import copy
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import display

mol1 = Chem.MolFromSmiles("C1CCNCC1")
mol2 = Chem.MolFromSmiles("c1ccccc1")
mol3 = Chem.MolFromSmiles("C1CC1CC")
mol4 = Chem.MolFromSmiles("C1CC1CCC2CCOC2")
mol5 = Chem.MolFromSmiles("C1CCC2CCCCC2C1")
mol6 = Chem.MolFromSmiles("OC1CC2(C1)CCCC2")
mol7 = Chem.MolFromSmiles("CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N")
mol8 = Chem.MolFromSmiles("c1ccccc1c2ccccc2OC")
mol9 = Chem.MolFromSmiles("CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5")
mol10 = Chem.MolFromSmiles("C1CCCC1CCCOCOCOCOCOCc1ccccc1")
mols = [mol1,mol2,mol3,mol4,mol5,mol6,mol7,mol8,mol9,mol10]

Then define main function. I wrote helper function for the main function. is_in_samering function check the atoms which construct a bond are in same ring or not.

def is_in_samering(idx1, idx2, bond_rings):
    for bond_ring in bond_rings:
        if idx1 in bond_ring and idx2 in bond_ring:
            return True
    return False

Following code, I set atom prop at first to keep original atom index. Because GetScaffoldForMol function returns new molecule with new atom index. I would like to use original atom idex in this function.

def getLinkerbond(mol, useScaffold=True):
    res = []
    for atom in mol.GetAtoms():
        atom.SetIntProp("orig_idx", atom.GetIdx())
    for bond in mol.GetBonds():
        bond.SetIntProp("orig_idx", bond.GetIdx())
    if useScaffold:
        mol = MurckoScaffold.GetScaffoldForMol(mol)
    ring_info = mol.GetRingInfo()
    bond_rings = ring_info.BondRings()
    ring_bonds = set()
    for ring_bond_idxs in bond_rings:
        for idx in ring_bond_idxs:
    all_bonds_idx = [bond.GetIdx() for bond in mol.GetBonds()]
    none_ring_bonds = set(all_bonds_idx) - ring_bonds
    for bond_idx in none_ring_bonds:
        bgn_idx = mol.GetBondWithIdx(bond_idx).GetBeginAtomIdx()
        end_idx = mol.GetBondWithIdx(bond_idx).GetEndAtomIdx()
        if mol.GetBondWithIdx(bond_idx).GetBondTypeAsDouble() == 1.0:
            if mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 1:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
            elif not is_in_samering(bgn_idx, end_idx, bond_rings) and mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 2:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
    return res

Check the function.

for mol in mols:
    bonds = getLinkerbond(mol)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)

The function cut murckoscaffold structure as default. So bonds not connect rings is not cut.

Next, case of useScaffold False. In this case all bonds outside of rings are cut.

for mol in mols:
    bonds = getLinkerbond(mol, False)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)

Works fine.

It is useful for molecular decomposition at linker site I think.

Today’s whole code is uploaded my gist. Any comments and advices will be high appreciated. ;)

Model interporation with new drawing code of RDKit #RDKit #Machine learning #chemoinformatics

Following code does not use new drawing code but it revised one of my old post. :)

I think, everyone who visits my blog has already read Gregs nice blog post about the drawing similarity map with new code. If you don’t read it I recommend to read it soon. URL is below. ;)

Now I like the drawing code because it can draw only atom property of own molecule but also compare molecular similarity and fingerprint contribution of predictive model.

Interpolation of the machine learning model is very important because it is use for next design of the new molecule.

So let’s write code with new drawing code. Following code is almost same as RDKit cook book and Greg’s blog post. Import some packages and check the version of rdkit. I used the newest version.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from functools import partial
from rdkit import Chem
from rdkit import rdBase
from rdkit import RDPaths
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import SimilarityMaps, IPythonConsole
from IPython.display import SVG
import io
from PIL import Image
import rdkit
import numpy as np
from sklearn.ensemble import RandomForestClassifier

Load solubility data which is provided from rdkit data folder for example.

trainpath = os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf')
testpath =  os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath) if m is not None]
test_mols = [m for m in Chem.SDMolSupplier(testpath) if m is not None]
val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}

Then define some helper function.

def mol2arr(mol, fpfunc):
    fp = fpfunc(mol)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
def show_png(data):
    bio = io.BytesIO(data)
    img =
    return img
fpfunc = partial(SimilarityMaps.GetMorganFingerprint, nBits=1024, radius=2)

Then calculate fingerprint and get solubility class. And make randomforest classifier object and fit it. It’s very common way for sunday chemoinformatician. ;)

trainX = [fpfunc(m) for m in train_mols]
trainY = [val_dict[m.GetProp('SOL_classification')] for m in train_mols]
testX = [fpfunc(m) for m in test_mols]
testY = [val_dict[m.GetProp('SOL_classification')] for m in test_mols]
rfc = RandomForestClassifier(random_state=794), trainY)

Next define getProba function the code is borrowed from rdkit cookbook. One difference to cookbook is I used third probability because I would like to visualize effect of the each fingerprint to the high solubility. And defined draw function with some optional arguments. ‘alpha’ is useful because it can control the blending value for the contour lines. It is key for beautiful visualization.

# Following example I would like to get probability of High solubility
def getProba(fp, predctionFunction):
    return predctionFunction((fp,))[0][2]
def drawmol(mols, idx):
    d = Draw.MolDraw2DCairo(1,1)
    _, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, rfc.predict_proba),

Now almost there. Let’s visualize High and Low solubility molecules with the drawing code!

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']
drawmol(high_test_mols, 0)

Hmm good, the model detects hydroxyl group is positive effect to solubility. How about second example?

drawmol(high_test_mols, 10)

OK, carbonyl group detected as positive group for solubility.

drawmol(low_test_mols, 5)

In this case it is difficult because all atoms are lipophilic. So remove any atoms are positive effect to solubility in my understanding.

The code investigates the effect of the each atom by removing one by one. And calculate difference of probability of prob molecule and atom remove molecule.

So, the magnitude of color shows relative effect of the each atom not absolute effect. This is reason why low solubility molecule shows many positive (red) colors in this example.

Any way, model interporation with 2D color mapping on the molecule is quite nice tool for QSAR analysis.

I would like to write more example codes.

Today’s code can be found on my gist and github.

Any comments suggestions and requests will be greatly appreciated. ;)


New visualization way for SAR #RDKit #chemoinformatics

Recently I found very cool visualization code for SAR that is provided by @simonduerr. It seems cool!

You can get the code from his repository.

I’ve never seen such a cool visualization. Current version of the package can fix the position of the rgroup automatically.

So I tried to use the code for my own dataset.

I got SAR data from ChEMBL, its target is MPS1. At first import packages and load data.

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import PandasTools
from rdkit.Chem import Draw
from radialscope import RadialScope as rs
from radialscope import draw_with_indeces
import matplotlib.pyplot as plt
from import ScalarMappable
import svgutils.compose as sc
from IPython.display import SVG, HTML
import numpy as np
import pandas as pd
df = pd.read_csv('./data/mps.csv', sep=';')

Then define core and template for data extraction and visualization. Next retrieve molecules and R group information which has defined core structure.

core = Chem.MolFromSmiles('c2cn(C1CCCCC1)c3ncncc23')
templ =  Chem.MolFromSmiles('c2cn(C1CCCCC1)c3nc(C)ncc23')
matchmols = []
AlogP = []
std_val = []
for idx, mol in enumerate(mols):
    if mol.HasSubstructMatch(core):
        pIC50 = np.round(9-np.log10(df['Standard Value'][idx]),2)

rg = []
for mol in matchmols:
    sc = Chem.ReplaceCore(mol,core)
    sc = Chem.MolToSmiles(sc).replace('[1*]', '~')

Then, define drawing settings, I defined white_cutoff to zero because I would like to parameter with white.

settings={ 'SMILESSTRING':Chem.MolToSmiles(templ),  # use methyl groups for your rests, smiles strings can't use labels
  'use_bw_atom_theme':False,  # draw all atoms black
 'use_bold_font': True, # replace all fonts wherever possible with bold text
 'white_cutoff': 0,  #make text of labels white if greater than this number
  'scalefactor':0.7  #scale the total plot in the end, for large molecule you will need to decrease this or make the viewbox bigger in a vector software such as Inkscape

Next, define R group information. I used pIC50 and AlogP as SAR information. And other setting is same as original example.

Attach_atom_id is key parameter, it defines where SAR information place.

atom index 0 is dummy methyl group of the template.

radial_scope_setup_R1= {
    'rest_label':"R$_1$", # Label of the atom in the radial scope plot
    'no_wedges':len(rg), # Number of wedges in the Radial scope Plot
    'coverangle_wedges':180, # Degrees of a full circle the Rscope should cover
    'startangle':-30, #Start angle of the Rscope 
    'CMAPINNER':"Blues",  # Colormap of the inner circle. Max value has full color, options see below
    'CMAPOUTER':"RdPu", # Colormap of the outer circle. Max value has full color
    'OUTERLABEL':"pIC50", # Label of the outer circle
    'INNERLABEL':"AlogP", # Label of the inner circle
    'rounding':True, # ENABLE rounding if you want something like >99 
    'rounding_boundary':10,  # cutoff for displaying > 
    'value_groups':rg, # Labels for the outer circle, you can use math using $, any ~ will be interpreted as smiles string
    'attach_atom_id': 0,

Now almost there, type following code.

scope_plot=rs(settings, radial_scope_setup_R1) # add all radial_scope dictionaries to this call

It works fine!

The package can visualize multiple substituents information. However I think it should be visualize one R group information per image. Because this way is difficult to understand combination of R groups.

I don’t understand the details of the code (it’s too hacky for me) but really useful for MedChem I think.

Thanks again for sharing the nice coode.

Today’s code was uploaded my gist and github repo.

Clustering molecules with HDBSCAN and predict chemical space with scikit-learn #chemoinformatics #RDKit

The contents of the post is almost same as yesterday’s one that was for souyaku-advent calendar 2019.

In drug discovery area we often discuss about compound chemical space. Chemist should be explore chemical space to find new IP space etc. It is defined with descriptors, fingerprint and biological response etc… It is often high dimensional space so we would like to reduce the dimension with some mathematical approach such as PCA, TSNE and UMAP.

And also we use clustering method for grouping molecules.

There are many clustering method for example k-means, hierarchical clustering. However k-means need to define k value and hierarchical method need to define threshold for doing that. So try and error is required. If you would like to get clustering label conveniently, I recommend to try to use HDBSCAN. HDBSCAN clusters data with their density. Ref URL is below.

Fortunately we can use HDBSCAN from python. It is easy to install via pip or conda. Please check original document to install. ;)

I tried to demo with HDBSCAN. Following data is bollowed from chaper8 in py4chemoinformatics (Thanks@fmkz___ for providing data and code!)

Let’s write code! Import packages at first. I’ll use mlinsght later.

%matplotlib inline
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
from rdkit import RDLogger
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.manifold import TSNE
from mlinsights.mlmodel import PredictableTSNE
from hdbscan import HDBSCAN

plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}
seed = 794

I defined tanimoto-dis tfunction for HDBSCAN. HDBSCAN requires metric arg. for measuring distances of each data point. And there isn’t tanimoto distance as a default metric. If user would like to original metric function, pass ‘pyfunc’ for metric arugment and your own function is passed to func argument. Func should be took two numpy array and return distance.
My tanimoto dist function is tooooooooooooooooooooooooo slow it has room for improvement I fill improve it later. ;)

oxrs = [("CHEMBL3098111", "Merck" ),  ("CHEMBL3867477", "Merck" ),  ("CHEMBL2380240", "Rottapharm" ),
             ("CHEMBL3352684", "Merck" ),  ("CHEMBL3769367", "Merck" ),  ("CHEMBL3526050", "Actelion" ),
             ("CHEMBL3112474", "Actelion" ),  ("CHEMBL3739366", "Heptares" ),  ("CHEMBL3739395", "Actelion" ), 
             ("CHEMBL3351489", "Eisai" )]

fps = []
docs = []
companies = []
mol_list = []
for cid, company in oxrs:
    sdf_file = os.path.join("data", cid + ".sdf")
    mols = Chem.SDMolSupplier(sdf_file)
    for mol in mols:
        if mol is not None:
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
            arr = np.zeros((1,))
            DataStructs.ConvertToNumpyArray(fp, arr)
fps = np.array(fps)
companies = np.array(companies)def tanimoto_dist(ar1, ar2):
    a =, ar2)
    b = ar1 + ar2 - ar1*ar2
    return 1 - a/np.sum(b)

clusterer = HDBSCAN(algorithm='best', min_samples=5, metric='pyfunc', func=tanimoto_dist)
docs = np.array(docs)

trainIDX, testIDX = train_test_split(range(len(fps)), random_state=seed)

Now ready, let’s check data without and with HDBSCAN. I made two plots one is coloured with company name and the other is coloured with HDBCAN label.

tsne = TSNE(random_state=seed)
res = tsne.fit_transform(fps)
plt.figure(figsize=(12, 6))
sns.scatterplot(res[:,0], res[:,1], hue=companies, **plot_kwds)

plt.figure(figsize=(12, 6))
palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
                 if col >= 0 else (0.5, 0.5, 0.5) for col, sat in zip(clusterer.labels_, clusterer.probabilities_)]
plt.scatter(res[:,0], res[:,1], c=cluster_colors, **plot_kwds)
coloured with company name
coloured with hdbscan label.

Coloured with Gray color point is not labelled data. The result shows HDBSCAN could do almost reasonable labelling in this case.

Plot new compound with defined chemical space

Next topic is how to plot new compound in pre-defined chemical space.
Theoretically, PCA and tSNE can’t add new data to pre-defined latent space. Re calculation is required if user would like to add new data. But we would like to map new designed compounds to current chemical space. Hmm…. how to do it?

The answer for the question is provided from mlinghts. The package has PredictableTSNE class. The class take transformer and estimator object. And it makes predictive model for transformer output (such as PCA, TSNE output) with estimator. It will be clear if you read source code. After learning, PredictableTSNE can predict latent space from data features.

Following code, I used random sampling and random forest and GP regressor for estimator.

trainFP = [fps[i] for i in trainIDX]
train_mol = [mol_list[i] for i in trainIDX]

testFP = [fps[i] for i in testIDX]
test_mol = [mol_list[i] for i in testIDX]
allFP = trainFP + testFP
tsne_ref = TSNE(random_state=seed)
res = tsne_ref.fit_transform(allFP)
plt.figure(figsize=(12, 6))
sns.scatterplot(res[:,0], res[:,1], hue=['train' for i in range(len(trainFP))] + ['test' for i in range(len(testFP))])
blue is training data, orange is test data.

Use predictiveTSNE for new compound chemical space prediction. Following examples are in RF case and GP case.

rfr = RandomForestRegressor(random_state=seed)
tsne1 = TSNE(random_state=seed)
pred_tsne_rfr = PredictableTSNE(transformer=tsne1, estimator=rfr, keep_tsne_outputs=True), list(range(len(trainFP))))

pred1 = pred_tsne_rfr.transform(testFP)
plt.figure(figsize=(12, 6))
plt.scatter(pred_tsne_rfr.tsne_outputs_[:,0], pred_tsne_rfr.tsne_outputs_[:,1], c='blue', alpha=0.5)
plt.scatter(pred1[:,0], pred1[:,1], c='red', alpha=0.5)

gbr = GaussianProcessRegressor(random_state=seed)
tsne2 = TSNE(random_state=seed)
pred_tsne_gbr = PredictableTSNE(transformer=tsne2, estimator=gbr, keep_tsne_outputs=True), list(range(len(trainFP)))pred2 = pred_tsne_gbr.transform(testFP)
plt.figure(figsize=(12, 6))
plt.scatter(pred_tsne_gbr.tsne_outputs_[:,0], pred_tsne_gbr.tsne_outputs_[:,1], c='blue', alpha=0.5)
plt.scatter(pred2[:,0], pred2[:,1], c='red', alpha=0.5))
RF as an estimator
GP as an estimator

In above case, random forest seems better performance for new compound prediction.

The method of estimator has a big influence for preditiveTSNE performance. It is useful for plot new data but it has risk for miss-leading for design.

At the end, I checked cluster label and preditiveTSNE data.

allmol = train_mol + test_mol
fps2 = []
clusterer2 = HDBSCAN(algorithm='best', min_samples=5, metric='pyfunc', func=tanimoto_dist)
for mol in allmol:
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
fps2 = np.array(fps2)
plt.figure(figsize=(12, 6))
plt.scatter(pred_tsne_rfr.tsne_outputs_[:,0], pred_tsne_rfr.tsne_outputs_[:,1], c=clusterer2.labels_[:len(trainFP)], alpha=0.5, marker='x')
plt.scatter(pred1[:,0], pred1[:,1], c=clusterer2.labels_[len(trainFP):], alpha=0.5, marker='o')

x training data, o test data, colour : class label.

Not only training data but also test data which has same colour are plotted almost same area. It indicates that the approach works well in this case.

Finally we often discuss chemicals pace and use the term ‘chemical space’ but it is very difficult to define, use and compare it.

All code is uploaded in my gist. Any comments and/or suggestions are greatly appreciated.

TIPS of rdkit #RDKit #memorandum #chemoinformatics

As you know, rdkit is very attractive and active project for chemoinformatics. Recent version of rdkit has heteroshuffle enumerator. It is useful for generate new molecules. You can find the details of the method below.

However in the drug discovery project, it is not needed enumerate all aromatic rings in the molecule I think. So enumeration of targeted ring is useful I think.

Fortunately EnumerateHeterocyles uses rdkit Reaction method by using defined reaction rules. It means that specific atoms can protect with setting atom property. Let’s write code! Following code I use omeprazole as example. And I got ring information to protect specific atoms.

from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem import EnumerateHeterocycles
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import copy
omeprazole = Chem.MolFromSmiles('CC1=CN=C(C(=C1OC)C)CS(=O)C2=NC3=C(N2)C=C(C=C3)OC')
ringinfo = omeprazole.GetRingInfo()
rings = ringinfo.AtomRings()

At first enumerate heterocycles without any restriction.

res = EnumerateHeterocycles.EnumerateHeterocycles(omeprazole)
res = [m for m in res]
> 384

Check generated molecules.

Draw.MolsToGridImage(res[:20], molsPerRow=3)

Next, protect pyridine ring and generate heterocycle of bicyclic moiety. As expected, number of generated molecules are reduced.

protected_omeprazole = copy.deepcopy(omeprazole)
for atmidx in rings[0]:
    atom = protected_omeprazole.GetAtomWithIdx(atmidx)
    atom.SetProp('_protected', '1')
res2 = EnumerateHeterocycles.EnumerateHeterocycles(protected_omeprazole)
res2 = [m for m in res2]
> 96

Check the structures.

Draw.MolsToGridImage(res2[:50], molsPerRow=5)

Protect works fine!

And I checked whether the function keep 3D structure of the molecule.

Also I worked fine.

So it is useful for generate specific focused library.

I uploaded the code to gist.

Molecular property regression with Attentive FP #RDKit #Chemoinformatics #DGL #DeepGraphLibrary

Recently Molecular Graph based deep learning is hot are in chemoinformatics.
Some months ago, Zhaoping et al. published new graph based QSAR model named ‘Attentive FP’ in JMC.

As its name suggests, Attentive FP uses attention mechanism for its architecture.

The authors disclosed their code. And fortunately, recent version of DGL is also Attentive FP!
Its repository provides an example of molecular property regression with attentive fp. However it is difficult to understand if I would like to use the FP against my dataset.
So I updated DGL and tried to use attentive FP. In the following code I used solubility data that is provided from rdkit for my practice.

First, import several packages for deep learning. DGL has many function for chemoinformatics task. Used doesn’t need implement functions which are required for chemo to graph conversion.

%matplotlib inline 
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDPaths

import dgl
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from import DataLoader
from import Dataset
from dgl import model_zoo

from import mol_to_complete_graph, mol_to_bigraph

from import atom_type_one_hot
from import atom_degree_one_hot
from import atom_formal_charge
from import atom_num_radical_electrons
from import atom_hybridization_one_hot
from import atom_total_num_H_one_hot
from import one_hot_encoding
from import CanonicalAtomFeaturizer
from import CanonicalBondFeaturizer
from import ConcatFeaturizer
from import BaseAtomFeaturizer
from import BaseBondFeaturizer

from import one_hot_encoding
from import split_dataset

from functools import partial
from sklearn.metrics import roc_auc_score

Then I defined some helper function for the task. Almost of the codes are borrowed from original dgl/example. Thanks for sharing the nice code!

def chirality(atom):
        return one_hot_encoding(atom.GetProp('_CIPCode'), ['R', 'S']) + \
        return [False, False] + [atom.HasProp('_ChiralityPossible')]
def collate_molgraphs(data):
    """Batching a list of datapoints for dataloader.
    data : list of 3-tuples or 4-tuples.
        Each tuple is for a single datapoint, consisting of
        a SMILES, a DGLGraph, all-task labels and optionally
        a binary mask indicating the existence of labels.
    smiles : list
        List of smiles
    bg : BatchedDGLGraph
        Batched DGLGraphs
    labels : Tensor of dtype float32 and shape (B, T)
        Batched datapoint labels. B is len(data) and
        T is the number of total tasks.
    masks : Tensor of dtype float32 and shape (B, T)
        Batched datapoint binary mask, indicating the
        existence of labels. If binary masks are not
        provided, return a tensor with ones.
    assert len(data[0]) in [3, 4], \
        'Expect the tuple to be of length 3 or 4, got {:d}'.format(len(data[0]))
    if len(data[0]) == 3:
        smiles, graphs, labels = map(list, zip(*data))
        masks = None
        smiles, graphs, labels, masks = map(list, zip(*data))

    bg = dgl.batch(graphs)
    labels = torch.stack(labels, dim=0)
    if masks is None:
        masks = torch.ones(labels.shape)
        masks = torch.stack(masks, dim=0)
    return smiles, bg, labels, masks

def run_a_train_epoch(n_epochs, epoch, model, data_loader,loss_criterion, optimizer):
    total_loss = 0
    losses = []
    for batch_id, batch_data in enumerate(data_loader):
        smiles, bg, labels, masks = batch_data
        if torch.cuda.is_available():
            labels ='cuda:0')
            masks ='cuda:0')
        prediction = model(bg, bg.ndata['hv'], bg.edata['he'])
        loss = (loss_criterion(prediction, labels)*(masks != 0).float()).mean()
        #loss = loss_criterion(prediction, labels)

    total_score = np.mean(losses)
    print('epoch {:d}/{:d}, training {:.4f}'.format( epoch + 1, n_epochs,  total_score))
    return total_score

After that, I defined atom and bond featurizer functions. Their settings are same as original repository but it is easy to modify the featurizer.

atom_featurizer = BaseAtomFeaturizer(
                 {'hv': ConcatFeaturizer([
                  partial(atom_type_one_hot, allowable_set=[
                          'B', 'C', 'N', 'O', 'F', 'Si', 'P', 'S', 'Cl', 'As', 'Se', 'Br', 'Te', 'I', 'At'],
                  partial(atom_degree_one_hot, allowable_set=list(range(6))),
                  atom_formal_charge, atom_num_radical_electrons,
                  partial(atom_hybridization_one_hot, encode_unknown=True),
                  lambda atom: [0], # A placeholder for aromatic information,
                    atom_total_num_H_one_hot, chirality
bond_featurizer = BaseBondFeaturizer({
                                     'he': lambda bond: [0 for _ in range(10)]

If you would like to the featurizer as same as DeepChem, you can use CanonicalAtom/BondFeaturizer.

DGL seems friendly for chemoinformatitian I think.

OK, let’s load dataset. mol_to_bigraph method with featurizer converts rdkit mol object to graph object. Also, smiles_to_bigraph method can convert smiles to graph! Cool ;)

train=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
test=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

train_mols = Chem.SDMolSupplier(train)
train_smi =[Chem.MolToSmiles(m) for m in train_mols]
train_sol = torch.tensor([float(mol.GetProp('SOL')) for mol in train_mols]).reshape(-1,1)

test_mols =  Chem.SDMolSupplier(test)
test_smi = [Chem.MolToSmiles(m) for m in test_mols]
test_sol = torch.tensor([float(mol.GetProp('SOL')) for mol in test_mols]).reshape(-1,1)

train_graph =[mol_to_bigraph(mol,
                           bond_featurizer=bond_featurizer) for mol in train_mols]

test_graph =[mol_to_bigraph(mol,
                           bond_featurizer=bond_featurizer) for mol in test_mols]

AttentivFp model is provided from model_zoo. And define dataloader for training and test.

model = model_zoo.chem.AttentiveFP(node_feat_size=39,
model ='cuda:0')

train_loader = DataLoader(dataset=list(zip(train_smi, train_graph, train_sol)), batch_size=128, collate_fn=collate_molgraphs)
test_loader = DataLoader(dataset=list(zip(test_smi, test_graph, test_sol)), batch_size=128, collate_fn=collate_molgraphs)

model = model_zoo.chem.AttentiveFP(node_feat_size=39,
model ='cuda:0')

train_loader = DataLoader(dataset=list(zip(train_smi, train_graph, train_sol)), batch_size=128, collate_fn=collate_molgraphs)
test_loader = DataLoader(dataset=list(zip(test_smi, test_graph, test_sol)), batch_size=128, collate_fn=collate_molgraphs)

Dataloader is pytorch native class. It generates iterator of butch of dataset.
Now almost there! Let’s go to learning process.

loss_fn = nn.MSELoss(reduction='none')
optimizer = torch.optim.Adam(model.parameters(), lr=10 ** (-2.5), weight_decay=10 ** (-5.0),)
n_epochs = 100
epochs = []
scores = []
for e in range(n_epochs):
    score = run_a_train_epoch(n_epochs, e, model, train_loader, loss_fn, optimizer)

>>>output is below.
epoch 1/100, training 8.8096
epoch 98/100, training 0.3706
epoch 99/100, training 0.3915
epoch 100/100, training 0.3003
plt.plot(epochs, scores)

It seems that learning process goes well ;).

OK let’s validate the model!

all_pred = []
for test_data in test_loader:
    smi_lst, bg, labels, masks = test_data
    if torch.cuda.is_available():
            labels ='cuda:0')
            masks ='cuda:0')
    pred = model(bg, bg.ndata['hv'], bg.edata['he'])
res = np.vstack(all_pred)
plt.scatter(res, test_sol)
from sklearn.metrics import r2_score
print(r2_score(test_sol, res))
> 0.9098691301661277

Let’s compare to RandomForest.

from sklearn.ensemble import RandomForestRegressor
from rdkit import Chem
from rdkit.Chem import AllChem
train_fp = [AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in train_mols]
test_fp = [AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in test_mols]
# make RF regressor and train it.
rfr = RandomForestRegressor(), train_sol)

Check the performance.

rfr_pred = rfr.predict(test_fp)
r2_score(test_sol, rfr_pred)
plt.scatter(rfr_pred, test_sol)

AttentiveFP model showed high performance for solubility prediction in this case.(The my code of RandomForest is not optimized.) DGL example code is very useful for beginner of DGL but it is difficult to apply to my own dataset. So I need to rewrite the code with my dataset.

Any way, I would like to buy Beer to DGL developper. DGL is very nice package for chemoinformatics and ‘RDKitds’. RDKits is new nickname of rdkit user, it is proposed in RDKit UGM 2019 ;)
Today’s code is below.

my whole code.


RDKit based CATS Descriptor #RDKit #Chemoinformatics

Chemically Advanced Template Search (CATS) is developed by Prof. Gisbert Schneider. CATS descriptor is ligand pharmacophore based fuzzy descriptor. So it is suitable for Scaffold hopping of virtual screening. Last week, I attended his lecture and had interest the descriptor again. Fortunately I could find some implementation of the CATS2D descriptor in github repo.

Arthuc’s work (original work is Rajarshi Guha) is nice because the code is written with python and rdkit is used in chemical engine.

I modified the work and made package for CATS2D descriptor calculation with RDKit.

Let’s see an example for distance calculation. At first import packages for calculation.

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from scipy.spatial.distance import euclidean
from cats2d.rd_cats2d import CATS2D
import numpy as np

from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

Then load two test molecules. These molecules are well known drugs.

mol1 = Chem.MolFromSmiles('CCCC1=NN(C2=C1N=C(NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C')
mol2 = Chem.MolFromSmiles('CCCC1=NC(=C2N1N=C(NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)CC)OCC)C')

OK let’s calculate distance of these molecules with Morgan FP and CASTS2D descriptors. CATS2D descriptor is not bit fingerprint, so I used eucdean distance. Surprisingly Tanimoto distance (1.0 – Tanimoto similarity) is very low even if these molecules looks similar.

fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, 2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, 2)
dist = 1.0 - DataStructs.TanimotoSimilarity(fp1, fp2)
> 0.41025641025641024

On the other hand, cats2d descriptor based distance is 0.0. It indicates that the two molecules are almost same based on their pharmacophore features.

cats = CATS2D()
cats1 = cats.getCATs2D(mol1)
cats2 = cats.getCATs2D(mol2)
euclidean(cats1, cats2)
> 0.0

Also the package can provide information of pharmacophore.

['', ['L'], '', '', ['A'], '', '', '', ['A'], '', ['D'], '', ['A'], '', '', '', '', '', '', '', ['A'], ['A'], ['A'], '', '', ['A'], '', '', '', ['A'], '', '', '']

['', ['L'], '', '', ['A'], '', '', '', ['A'], '', ['D'], '', ['A'], '', '', '', '', '', '', '', ['A'], ['A'], ['A'], '', '', ['A'], '', '', '', '', ['A'], '', '', '']

Scaffold hopping is very useful strategy of drug discovery for not only improving compound properties but also expanding IP space.

I would like to improve the package because the package is still under development.

Any comments a/o suggestions are greatly appreciated.

My code can be found following URL.

Thanks for developing and sharing CATS2D descriptor implementation!