Molecular property regression with Attentive FP #RDKit #Chemoinformatics #DGL #DeepGraphLibrary

Recently Molecular Graph based deep learning is hot are in chemoinformatics.
Some months ago, Zhaoping et al. published new graph based QSAR model named ‘Attentive FP’ in JMC.

As its name suggests, Attentive FP uses attention mechanism for its architecture.

The authors disclosed their code. And fortunately, recent version of DGL is also Attentive FP!
Its repository provides an example of molecular property regression with attentive fp. However it is difficult to understand if I would like to use the FP against my dataset.
So I updated DGL and tried to use attentive FP. In the following code I used solubility data that is provided from rdkit for my practice.

First, import several packages for deep learning. DGL has many function for chemoinformatics task. Used doesn’t need implement functions which are required for chemo to graph conversion.

%matplotlib inline 
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDPaths

import dgl
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from dgl import model_zoo

from dgl.data.chem.utils import mol_to_complete_graph, mol_to_bigraph

from dgl.data.chem.utils import atom_type_one_hot
from dgl.data.chem.utils import atom_degree_one_hot
from dgl.data.chem.utils import atom_formal_charge
from dgl.data.chem.utils import atom_num_radical_electrons
from dgl.data.chem.utils import atom_hybridization_one_hot
from dgl.data.chem.utils import atom_total_num_H_one_hot
from dgl.data.chem.utils import one_hot_encoding
from dgl.data.chem import CanonicalAtomFeaturizer
from dgl.data.chem import CanonicalBondFeaturizer
from dgl.data.chem import ConcatFeaturizer
from dgl.data.chem import BaseAtomFeaturizer
from dgl.data.chem import BaseBondFeaturizer

from dgl.data.chem import one_hot_encoding
from dgl.data.utils import split_dataset

from functools import partial
from sklearn.metrics import roc_auc_score

Then I defined some helper function for the task. Almost of the codes are borrowed from original dgl/example. Thanks for sharing the nice code!

def chirality(atom):
    try:
        return one_hot_encoding(atom.GetProp('_CIPCode'), ['R', 'S']) + \
               [atom.HasProp('_ChiralityPossible')]
    except:
        return [False, False] + [atom.HasProp('_ChiralityPossible')]
    
def collate_molgraphs(data):
    """Batching a list of datapoints for dataloader.
    Parameters
    ----------
    data : list of 3-tuples or 4-tuples.
        Each tuple is for a single datapoint, consisting of
        a SMILES, a DGLGraph, all-task labels and optionally
        a binary mask indicating the existence of labels.
    Returns
    -------
    smiles : list
        List of smiles
    bg : BatchedDGLGraph
        Batched DGLGraphs
    labels : Tensor of dtype float32 and shape (B, T)
        Batched datapoint labels. B is len(data) and
        T is the number of total tasks.
    masks : Tensor of dtype float32 and shape (B, T)
        Batched datapoint binary mask, indicating the
        existence of labels. If binary masks are not
        provided, return a tensor with ones.
    """
    assert len(data[0]) in [3, 4], \
        'Expect the tuple to be of length 3 or 4, got {:d}'.format(len(data[0]))
    if len(data[0]) == 3:
        smiles, graphs, labels = map(list, zip(*data))
        masks = None
    else:
        smiles, graphs, labels, masks = map(list, zip(*data))

    bg = dgl.batch(graphs)
    bg.set_n_initializer(dgl.init.zero_initializer)
    bg.set_e_initializer(dgl.init.zero_initializer)
    labels = torch.stack(labels, dim=0)
    
    if masks is None:
        masks = torch.ones(labels.shape)
    else:
        masks = torch.stack(masks, dim=0)
    return smiles, bg, labels, masks


def run_a_train_epoch(n_epochs, epoch, model, data_loader,loss_criterion, optimizer):
    model.train()
    total_loss = 0
    losses = []
    
    for batch_id, batch_data in enumerate(data_loader):
        batch_data
        smiles, bg, labels, masks = batch_data
        if torch.cuda.is_available():
            bg.to(torch.device('cuda:0'))
            labels = labels.to('cuda:0')
            masks = masks.to('cuda:0')
        
        prediction = model(bg, bg.ndata['hv'], bg.edata['he'])
        loss = (loss_criterion(prediction, labels)*(masks != 0).float()).mean()
        #loss = loss_criterion(prediction, labels)
        #print(loss.shape)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        losses.append(loss.data.item())
        

    total_score = np.mean(losses)
    print('epoch {:d}/{:d}, training {:.4f}'.format( epoch + 1, n_epochs,  total_score))
    return total_score

After that, I defined atom and bond featurizer functions. Their settings are same as original repository but it is easy to modify the featurizer.

atom_featurizer = BaseAtomFeaturizer(
                 {'hv': ConcatFeaturizer([
                  partial(atom_type_one_hot, allowable_set=[
                          'B', 'C', 'N', 'O', 'F', 'Si', 'P', 'S', 'Cl', 'As', 'Se', 'Br', 'Te', 'I', 'At'],
                    encode_unknown=True),
                  partial(atom_degree_one_hot, allowable_set=list(range(6))),
                  atom_formal_charge, atom_num_radical_electrons,
                  partial(atom_hybridization_one_hot, encode_unknown=True),
                  lambda atom: [0], # A placeholder for aromatic information,
                    atom_total_num_H_one_hot, chirality
                 ],
                )})
bond_featurizer = BaseBondFeaturizer({
                                     'he': lambda bond: [0 for _ in range(10)]
    })

If you would like to the featurizer as same as DeepChem, you can use CanonicalAtom/BondFeaturizer.
https://docs.dgl.ai/en/latest/generated/dgl.data.chem.CanonicalAtomFeaturizer.html
https://docs.dgl.ai/en/latest/generated/dgl.data.chem.CanonicalBondFeaturizer.html

DGL seems friendly for chemoinformatitian I think.

OK, let’s load dataset. mol_to_bigraph method with featurizer converts rdkit mol object to graph object. Also, smiles_to_bigraph method can convert smiles to graph! Cool ;)

train=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
test=os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

train_mols = Chem.SDMolSupplier(train)
train_smi =[Chem.MolToSmiles(m) for m in train_mols]
train_sol = torch.tensor([float(mol.GetProp('SOL')) for mol in train_mols]).reshape(-1,1)

test_mols =  Chem.SDMolSupplier(test)
test_smi = [Chem.MolToSmiles(m) for m in test_mols]
test_sol = torch.tensor([float(mol.GetProp('SOL')) for mol in test_mols]).reshape(-1,1)


train_graph =[mol_to_bigraph(mol,
                           atom_featurizer=atom_featurizer, 
                           bond_featurizer=bond_featurizer) for mol in train_mols]

test_graph =[mol_to_bigraph(mol,
                           atom_featurizer=atom_featurizer, 
                           bond_featurizer=bond_featurizer) for mol in test_mols]

AttentivFp model is provided from model_zoo. And define dataloader for training and test.

model = model_zoo.chem.AttentiveFP(node_feat_size=39,
                                  edge_feat_size=10,
                                  num_layers=2,
                                  num_timesteps=2,
                                  graph_feat_size=200,
                                  output_size=1,
                                  dropout=0.2)
model = model.to('cuda:0')

train_loader = DataLoader(dataset=list(zip(train_smi, train_graph, train_sol)), batch_size=128, collate_fn=collate_molgraphs)
test_loader = DataLoader(dataset=list(zip(test_smi, test_graph, test_sol)), batch_size=128, collate_fn=collate_molgraphs)

model = model_zoo.chem.AttentiveFP(node_feat_size=39,
edge_feat_size=10,
num_layers=2,
num_timesteps=2,
graph_feat_size=200,
output_size=1,
dropout=0.2)
model = model.to('cuda:0')

train_loader = DataLoader(dataset=list(zip(train_smi, train_graph, train_sol)), batch_size=128, collate_fn=collate_molgraphs)
test_loader = DataLoader(dataset=list(zip(test_smi, test_graph, test_sol)), batch_size=128, collate_fn=collate_molgraphs)

Dataloader is pytorch native class. It generates iterator of butch of dataset.
Now almost there! Let’s go to learning process.

loss_fn = nn.MSELoss(reduction='none')
optimizer = torch.optim.Adam(model.parameters(), lr=10 ** (-2.5), weight_decay=10 ** (-5.0),)
n_epochs = 100
epochs = []
scores = []
for e in range(n_epochs):
    score = run_a_train_epoch(n_epochs, e, model, train_loader, loss_fn, optimizer)
    epochs.append(e)
    scores.append(score)

>>>output is below.
epoch 1/100, training 8.8096
----snip---
epoch 98/100, training 0.3706
epoch 99/100, training 0.3915
epoch 100/100, training 0.3003
plt.plot(epochs, scores)

It seems that learning process goes well ;).

OK let’s validate the model!

model.eval()
all_pred = []
for test_data in test_loader:
    smi_lst, bg, labels, masks = test_data
    if torch.cuda.is_available():
            bg.to(torch.device('cuda:0'))
            labels = labels.to('cuda:0')
            masks = masks.to('cuda:0')
    pred = model(bg, bg.ndata['hv'], bg.edata['he'])
    all_pred.append(pred.data.cpu().numpy())
res = np.vstack(all_pred)
plt.clf()
plt.scatter(res, test_sol)
plt.xlabel('pred')
plt.ylabel('exp')
from sklearn.metrics import r2_score
print(r2_score(test_sol, res))
> 0.9098691301661277

Let’s compare to RandomForest.

from sklearn.ensemble import RandomForestRegressor
from rdkit import Chem
from rdkit.Chem import AllChem
train_fp = [AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in train_mols]
test_fp = [AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in test_mols]
# make RF regressor and train it.
rfr = RandomForestRegressor()
rfr.fit(train_fp, train_sol)

Check the performance.

rfr_pred = rfr.predict(test_fp)
r2_score(test_sol, rfr_pred)
plt.clf()
plt.scatter(rfr_pred, test_sol)

AttentiveFP model showed high performance for solubility prediction in this case.(The my code of RandomForest is not optimized.) DGL example code is very useful for beginner of DGL but it is difficult to apply to my own dataset. So I need to rewrite the code with my dataset.

Any way, I would like to buy Beer to DGL developper. DGL is very nice package for chemoinformatics and ‘RDKitds’. RDKits is new nickname of rdkit user, it is proposed in RDKit UGM 2019 ;)
Today’s code is below.

my whole code.

Thanks,

RDKit based CATS Descriptor #RDKit #Chemoinformatics

Chemically Advanced Template Search (CATS) is developed by Prof. Gisbert Schneider. CATS descriptor is ligand pharmacophore based fuzzy descriptor. So it is suitable for Scaffold hopping of virtual screening. Last week, I attended his lecture and had interest the descriptor again. Fortunately I could find some implementation of the CATS2D descriptor in github repo.

Arthuc’s work (original work is Rajarshi Guha) is nice because the code is written with python and rdkit is used in chemical engine.

I modified the work and made package for CATS2D descriptor calculation with RDKit.

Let’s see an example for distance calculation. At first import packages for calculation.

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from scipy.spatial.distance import euclidean
from cats2d.rd_cats2d import CATS2D
import numpy as np

from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

Then load two test molecules. These molecules are well known drugs.

mol1 = Chem.MolFromSmiles('CCCC1=NN(C2=C1N=C(NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C')
mol2 = Chem.MolFromSmiles('CCCC1=NC(=C2N1N=C(NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)CC)OCC)C')

OK let’s calculate distance of these molecules with Morgan FP and CASTS2D descriptors. CATS2D descriptor is not bit fingerprint, so I used eucdean distance. Surprisingly Tanimoto distance (1.0 – Tanimoto similarity) is very low even if these molecules looks similar.

fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1, 2)
fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2, 2)
dist = 1.0 - DataStructs.TanimotoSimilarity(fp1, fp2)
print(dist)
> 0.41025641025641024

On the other hand, cats2d descriptor based distance is 0.0. It indicates that the two molecules are almost same based on their pharmacophore features.

cats = CATS2D()
cats1 = cats.getCATs2D(mol1)
cats2 = cats.getCATs2D(mol2)
euclidean(cats1, cats2)
> 0.0

Also the package can provide information of pharmacophore.

print(cats.getPcoreGroups(mol1))
>
['', ['L'], '', '', ['A'], '', '', '', ['A'], '', ['D'], '', ['A'], '', '', '', '', '', '', '', ['A'], ['A'], ['A'], '', '', ['A'], '', '', '', ['A'], '', '', '']

print(cats.getPcoreGroups(mol2))
>
['', ['L'], '', '', ['A'], '', '', '', ['A'], '', ['D'], '', ['A'], '', '', '', '', '', '', '', ['A'], ['A'], ['A'], '', '', ['A'], '', '', '', '', ['A'], '', '', '']

Scaffold hopping is very useful strategy of drug discovery for not only improving compound properties but also expanding IP space.

I would like to improve the package because the package is still under development.

Any comments a/o suggestions are greatly appreciated.

My code can be found following URL.
https://github.com/iwatobipen/CATS2D

Thanks for developing and sharing CATS2D descriptor implementation!

Python package for Ensemble learning #Chemoinformatics #Scikit learn

Ensemble learning is a technique for machine learning. I wrote post about blending learning before. URL is below.
https://iwatobipen.wordpress.com/2018/11/11/ensemble-learning-with-scikit-learn-and-xgboost-machine-learning/
I implemented the code by myself at that time.

Ensemble learning sometime outperform than single model. So it is useful for try to use the method. Fortunately now we can use ensemble learning very easily by using a python package named ‘ML-Ens‘ Installation is very easy, only use pip command common way for pythonista I think ;)

After installing the package user can build and train ensemble learning model with few lines. I would like to introduce two example of them one is stacking method and the other is a blending method. OK let’s go to code.

At first, load dataset and make input features. I used morgan fingerprint as input data.

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit import RDPaths
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import PandasTools
import numpy as np
import pandas as pd
from IPython.display import HTML
traindf = PandasTools.LoadSDF(os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf'))
testdf = PandasTools.LoadSDF(os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf'))
# Chek data
HTML(traindf.head(2).to_html())

cls2lab = {'(A) low':0, '(B) medium':1, '(C) high':2}

def fp2np(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
trainfp = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in traindf.ROMol]
testfp =  [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in testdf.ROMol]
trainX = np.array([fp2np(fp) for fp in trainfp])
testX = np.array([fp2np(fp) for fp in testfp])
trainY = np.array([cls2lab[i] for i in traindf.SOL_classification.to_list()])
testY =  np.array([cls2lab[i] for i in testdf.SOL_classification.to_list()])

Then import several package for ensemble learning. SuperLearner is class for stacking and BlendEnsemble is class for blending.

Making ensemble model is easy. Just use add method to layer addition and finally call add_meta method for adding final prediction layer.

from mlens.ensemble import SuperLearner
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, accuracy_score
from sklearn.svm import SVR, SVC

# For stacnking
ensemble = SuperLearner(scorer=accuracy_score, random_state=794, verbose=2)
ensemble.add([RandomForestClassifier(n_estimators=100, random_state=794), SVC(gamma='auto', C=1000)])
ensemble.add_meta(LogisticRegression(solver='lbfgs', multi_class='auto'))

ensemble.fit(trainX, trainY)
pred = ensemble.predict(testX)
accuracy_score(testY, pred)

# Blending
from mlens.ensemble import BlendEnsemble
ensemble2 = BlendEnsemble(scorer=accuracy_score, test_size=0.2, verbose=2)
ensemble2.add([RandomForestClassifier(n_estimators=794, random_state=794),
                 SVC(gamma='auto')])
ensemble2.add_meta(LogisticRegression(solver='lbfgs', multi_class='auto'))
ensemble2.fit(trainX, trainY)
pred_b = ensemble2.predict(testX)
accuracy_score(pred_b, testY)

Also more models can added with add method. I uploaded whole code on my gist. After calling fit, it is easy to access result data by using data method.

code example

Unfortunately the ensemble models described in the post don’t outperform single random forest model but mlens is nice tool for ensemble learning there is still room of improvement for model performance such as kind of models, hyper parameters etc.

Original document give more informations. Please go to link if reader has interest.
http://ml-ensemble.com/docs/

Calculate free solvent accessible surface area #RDKit #Chemoinformatics

Recent version of rdkit has method to calculate FreeSASA.
I never used the function so I used it. So I tried to use it.

I calculated freeSASA with very simple molecules Phenol and hydroxy pyridine.

from rdkit import Chem
from rdkit.Chem import rdFreeSASA
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
mol1 = Chem.MolFromSmiles('Oc1ccccc1')
mol2 = Chem.MolFromSmiles('Oc1ccncc1')
hmol1 = Chem.AddHs(mol1)
hmol2 = Chem.AddHs(mol2)
AllChem.EmbedMolecule(hmol1)
AllChem.EmbedMolecule(hmol2)

To calculate FreeSASA, prepare raddii is needed.

radii1 = rdFreeSASA.classifyAtoms(hmol1)
radii2 = rdFreeSASA.classifyAtoms(hmol2)

Now ready, let’s calculate FreeSASA.

rdFreeSASA.CalcSASA(hmol1, radii1)
> 137.43293375181904
rdFreeSASA.CalcSASA(hmol2, radii2)
> 128.34398350646256

At first I expected that FreeSASA of pyridine is larger than phenol but result is opposite. So I would like to know details of the reason.

After calculating FreeSASA, each atom has SASA property.

atoms1 = hmol1.GetAtoms()
atoms2 = hmol2.GetAtoms()
for i in range(len(atoms1)):
    print(atoms1[i].GetSymbol(), atoms1[i].GetProp('SASAClassName'), atoms1[i].GetProp("SASA"))
    
sum(float(a.GetProp("SASA")) for a in atoms1)


O Unclassified 10.276248749137361
C Unclassified 5.6117335908330768
C Unclassified 4.8812286399274658
C Unclassified 4.9178986731131236
C Unclassified 4.923259125887407
C Unclassified 4.8241215955112828
C Unclassified 4.8595021375180254
H Unclassified 16.645522512291386
H Unclassified 16.254710140190241
H Unclassified 15.866400115020539
H Unclassified 16.022421230036539
H Unclassified 16.089713316178983
H Unclassified 16.260173926173582
137.43293375181904

for i in range(len(atoms2)):
    print(atoms2[i].GetSymbol(), atoms2[i].GetProp('SASAClassName'), atoms2[i].GetProp("SASA"))
    
sum(float(a.GetProp("SASA")) for a in atoms2)

O Unclassified 10.443721296042458
C Unclassified 5.5711494477882848
C Unclassified 4.7609239637426501
C Unclassified 4.9640112698257193
N Unclassified 11.64593971756287
C Unclassified 4.6638358234073181
C Unclassified 5.0220733873889731
H Unclassified 16.474508728534751
H Unclassified 16.091035411384464
H Unclassified 16.409635462176684
H Unclassified 16.194368688350266
H Unclassified 16.102780310258137
128.34398350646256

The reason is that phenol has one more hydrogen atom and it occupy more surface area than aromatic nitrogen.

I think the parameter can use a property for GCN.
Today’s sample code is here.

A new function of rdkit201909 #RDKit #Chemoinformatics

Last week, I got cool information that new version of rdkit is available by installing conda!

You can check new feature in ReleaseNotes in original repo,
https://github.com/rdkit/rdkit/blob/master/ReleaseNotes.md

I installed RDKit201909 now and am reading document now.

MolHash function originally developed from nextmove software is implemented in the version.
[RDKit Document]https://www.rdkit.org/docs/source/rdkit.Chem.rdMolHash.html
[Nextmove software]https://nextmovesoftware.github.io/molhash/commandline.html#usage

MolHash can generates some hashes of molecule. I used the module with a molecule. Code is below.

from rdkit import Chem                                                                       
from rdkit.Chem import rdBase                                                                
 print(rdBase.rdkitVersion)                                                                   
> 2019.09.1

from rdkit.Chem import rdMolHash                                                             
# read sample molecule from SMILES
tofa = Chem.MolFromSmiles('C[C@@H]1CCN(C[C@@H]1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N')             
# Generate molhash
print(rdMolHash.GenerateMoleculeHashString(tofa))                                            
100-23-25-Hx54Xg-jb4lRQ-VbT4fg-F92gVQ-wedGAQ-s3wCEg

MolHash module provides some molhash functions. And it can get as dictionary with ‘names’ method and by calling MolHash method with these functions, I could get hash string.

rdMolHash.HashFunction.names                                                         

{'AnonymousGraph': rdkit.Chem.rdMolHash.HashFunction.AnonymousGraph,
 'ElementGraph': rdkit.Chem.rdMolHash.HashFunction.ElementGraph,
 'CanonicalSmiles': rdkit.Chem.rdMolHash.HashFunction.CanonicalSmiles,
 'MurckoScaffold': rdkit.Chem.rdMolHash.HashFunction.MurckoScaffold,
 'ExtendedMurcko': rdkit.Chem.rdMolHash.HashFunction.ExtendedMurcko,
 'MolFormula': rdkit.Chem.rdMolHash.HashFunction.MolFormula,
 'AtomBondCounts': rdkit.Chem.rdMolHash.HashFunction.AtomBondCounts,
 'DegreeVector': rdkit.Chem.rdMolHash.HashFunction.DegreeVector,
 'Mesomer': rdkit.Chem.rdMolHash.HashFunction.Mesomer,
 'HetAtomTautomer': rdkit.Chem.rdMolHash.HashFunction.HetAtomTautomer,
 'HetAtomProtomer': rdkit.Chem.rdMolHash.HashFunction.HetAtomProtomer,
 'RedoxPair': rdkit.Chem.rdMolHash.HashFunction.RedoxPair,
 'Regioisomer': rdkit.Chem.rdMolHash.HashFunction.Regioisomer,
 'NetCharge': rdkit.Chem.rdMolHash.HashFunction.NetCharge,
 'SmallWorldIndexBR': rdkit.Chem.rdMolHash.HashFunction.SmallWorldIndexBR,
 'SmallWorldIndexBRL': rdkit.Chem.rdMolHash.HashFunction.SmallWorldIndexBRL,
 'ArthorSubstructureOrder': rdkit.Chem.rdMolHash.HashFunction.ArthorSubstructureOrder}

Generate MolHash with all defined hash functions.

for k, v in molhashf.items(): 
    print(k, rdMolHash.MolHash(tofa, v)) 

>                                                                                             
AnonymousGraph ****(*)*1**[*@@](*)[*@@](*(*)*2****3****23)*1
ElementGraph C[C@@H]1CCN(C(O)CCN)C[C@@H]1N(C)C1NCNC2NCCC12
CanonicalSmiles C[C@@H]1CCN(C(=O)CC#N)C[C@@H]1N(C)c1ncnc2[nH]ccc12
MurckoScaffold c1nc(N[C@@H]2CCCNC2)c2cc[nH]c2n1
ExtendedMurcko *[C@@H]1CCN(*)C[C@@H]1N(*)c1ncnc2[nH]ccc12
MolFormula C16H20N6O
AtomBondCounts 23,25
DegreeVector 0,8,11,4
Mesomer C[C@@H]1CCN([C]([O])C[C][N])C[C@@H]1N(C)[C]1[N][CH][N][C]2N[CH][CH][C]12_0
HetAtomTautomer C[C@@H]1CCN([C]([O])C[C][N])C[C@@H]1N(C)[C]1[N][CH][N][C]2[N][CH][CH][C]21_1_0
HetAtomProtomer C[C@@H]1CCN([C]([O])C[C][N])C[C@@H]1N(C)[C]1[N][CH][N][C]2[N][CH][CH][C]21_1
RedoxPair C[C@@H]1CCN([C]([O])C[C][N])C[C@@H]1N(C)[C]1[N][CH][N][C]2N[CH][CH][C]12
Regioisomer *C.*C(=O)CC#N.*N(*)*.C.C1CNC[C@H2][C@H2]1.c1ncc2cc[nH]c2n1
NetCharge 0
SmallWorldIndexBR B25R3
SmallWorldIndexBRL B25R3L11
ArthorSubstructureOrder 001700190100100007000092000000

Details of these functions are described in NextMove web page.
https://nextmovesoftware.github.io/molhash/introduction.html

It is interesting for getting several information by using the functions. I’ll think about the application of the method for chemoinformatics task.

Virtual screening with quantum computing! #Quantum_computing #chemoinformatics #memorandum

Quantum computing is one of the hot area in these days. Now google reported exciting article to nature.

Quantum computing reach is also very useful for drug discovery. Some days ago I found interesting article published by researcher in 1QBit, Univ. Drive, Accenture Labs and Biogen. URL is below.
A Quantum-Inspired Method for Three-Dimensional Ligand-Based Virtual Screening

The authors used quantum computer for Ligand Based Virtual Screening(LBVS). Their strategy is below.

Made molecular graph which like pharmacophore representation of molecule. And the search maximum common sub graph(MCS) with quantum computing.

Find MCS problem is NP hard. So it requires huge computational cost. But they solved the issue with quantum annealar.

To find the MCS, they used conflict graph. I’m not familiar the concept but regarding the publication, the graph made from two molecular graphs.

From the fig3 in the article. (Pls check original article because I can’t share figure)

Construction of the conflict graph G c from two given graphs G 1 and G 2 . A vertex (v 1 , v a ) is added to G c because at least one of the labels in the set of labels associated with v 1 from G 1 (3 v 1 ), and v a from G 2 (3 v a ) match. In this case, the set of labels match exactly so we designate the new vertex (v 1 , v a ) as an exact match. The rest of the vertices in the conflict graph are added in the same way. Edges are added according to two
conditions: bijective mapping and distance violations. Bijective mapping is violated if one of the nodes has been matched twice (represented by a red edge). Distance violation aims to incorporate 3D molecular information (represented by a green edge). An edge between two vertices (e.g.,
between (v 1 , v b ) and (v 2 , v c )) is added if the Euclidean distance between v 1 and v 2 is not comparable to the Euclidean distance between v b and v c .
Formally, an edge is added if |d(v 1 , v 2 ) − d(v b , v c )| > ε (ε = 0.4 in this example).

Regarding the concept described above, less edge graph is preferred for maximum common substructure.

And by using the method, graph based approach outperformed Morgan finger print based LBVS against several targets.

It indicates that quantum computer is useful for drug discovery.

Unfortunately the calculation performance of quantum computation is not described in this article.
I would like to know comparison between current traditional computation and quantum computation.

Electrostatic Potential Surface(ESP) calculation with GCNN #RDKit #chemoinformatics

ESP is a key feature in Drug Discovery. There are many publications discussing ESP in Drug Design. However getting accurate ESP is time-consuming because it needs high level QM calculations. To reduce the calculation cost of QM, predict quantum nature by using Deep learning is researched.

And some days ago, I found interesting article published by reseachers in Astex. URL is below.
https://pubs.acs.org/doi/10.1021/acs.jmedchem.9b01129

They build GCNN model with over than 100,000 diverse drug like molecules QM calculation (B3LYP/6-31G* basis set) data set. And their model showed good performance in ESP calculation, pKa prediction and binding affinity prediction.

And also GCNN based calculation is faster than QM based calculation, Fig 3 shows that their approach is 4 order faster than conventional DFT based calculation.

Luckly, author disclosed the code on github. ;)
https://github.com/AstexUK/ESP_DNN

….But,,,,,, the code is written for python 2.x.
Hmm…. I would like to use the code in python 3.x.

So I forked ESP_DNN and modify the original code. You can find the code here.
https://github.com/iwatobipen/ESP_DNN

My env for the ESP_DNN is below.
Python 3.7
numpy 1.17.2
rdkit 2019.03.4
xarray 0.14.0
tensorflow-gpu 1.14.0
keras-base 2.2.4

Install is very easy.

$ git clone https://github.com/iwatobipen/ESP_DNN.git
$ cd ESP_DNN
$ pip install -e . # or/ pip install .

Then run the example code which is introduced in README.md

ESP_DNN$ python -m esp_dnn.predict -m ligand -i examples/ligands/

After running the code forementioned, pqr files are generated in -i folder.
It can be visualized in http://nglviewer.org/ngl/ where is recommend in the document.

Open the URL with web browser and read a pqr file.
Then change and set visualize settings according to the document.
Finally I could get image with ESP. Following ESP is generated from provided GCNN model.

QM based approach is powerful for drug discovery however the calculation cost is high. On the other side, deep learning based approach is very fast but it requires high quality training data.

So I think the authors work is very useful because they provided not only their code but also high quality trained model. Awesome work!

If I have to say something, in the original repo seems that training dataset is not provided. I would like to check chemical space of training data.

Anyway, calculation of accurate ESP in short time is very useful for me. I would like to apply some prediction tasks ASAP.