RDKit UGM 2019 in Hambrug! #RDKit #Chemoinformatics

Last week, I participated RDKitUGM 2019! In this year, there were more than a hundred participants in the UGM. The meeting is growing year by year. The meeting repo is below and I think some presentation will be uploaded in few days ~ weeks I think.

https://github.com/rdkit/UGM_2019

And twitter has tag is #RDKitUGM2019.

Following document is a memorandum for myself. It is not fully cover the meeting topics and don’t describe details because some works are unpublished.

Agenda https://github.com/rdkit/UGM_2019/blob/master/Info/Draft_Agenda.pdf

Day1

Floriane Montanari and Robin Winter: Utilizing in silico models in both directions: prediction and optimizing the properties of small molecules

Mahendra Awale: SAR Transfer via Matched Molecular Series

  • I thought MMS/MMP is useful for ADMET SAR transfer but difficult for bio activity SAR transfer.
  • But he presented some examples for bio activity SAR transformation.
  • MMS is easy to understand for MedChem compared to deep learning.
  • It is very interesting approach for me.
  • https://www.ncbi.nlm.nih.gov/pubmed/30108724

Martin Vogt:Systematic extraction of analogue series from large compound collections

Paul Czodrowski: Is bigger always better? Comparing two strategies for the generation of predictive models based on different computational resources

Chaya Stern: Improving molecular models by generating high-quality quantum chemistry data

Esben Bjerrum: Molecular De Novo Design – using Deep Learning Encoders and Generators together with RDKit

  • Some years ago, AZ group published RNN based molecular generator named REINVENT. I think it is quite nice tool. And In the presentation, conditional RNN based generator is described.
  • You can find his nice talk material following url. link

Day2

Dominique Sydow and Jaime Rodríguez-Guerra: TeachOpenCADD: An open source teaching platform for computer-aided drug design

Jan Halborg Jensen: Quantum chemistry meets cheminformatics

Brian Kelley: "Learned" Molecule Representations – a technical comparison with data from real projects

  • This talk is very useful for me. The way of molecular representations are described.

Christoph Bauer: Generation of Bimolecular 3D Complex Structures with RDKit

Suliman Sharif: Cocktail Shaker: An open source drug expansion and enumeration library using Python and RDKit

Lighting talk and poster sessions

  • There many interesting topics in these sessions.
  • Such as rdkit-neo4j integration, rdkit-QM integration, MMPDB cloud funding etc…

And day3, I participated Knime Work shop because recent knime has lots of useful node. I could learn how to use knime in chemoinformatics.

I could have useful discussions with participants and enjoy not only the meeting but also food (of course beer too!), view of Hamburg.

I would like to say thank you all participants. I got a lot of energy and motivation from the UGM.

Knime setting for ubuntu18.04 #Knime

I’m waiting my fright in Hamburg Air Port. I will post about UGM after back to Japan. I would like to say thank all participants.

BTW, yesterday I participated Knime workshop and learned knime basics. Recently I started to use not only python but also knime. New version of knime has lots of nodes for data analysis, chemoinformatics and data visualization(interactive JS plot!) .

I installed knime ver 4.01 on my lap top and used it but when I launch the knime, I got some error messages and GUI layout is broken….. Opps…

I searched web and found the solution. Following document is about solution for old version of knime but it worked for me.

When I run knime with GTK 2 instead of GTK 3, error is gone. The problem comes from Eclipse 4.5(?).

https://www.knime.com/faq#q32

Now my knime works very well. I’m not sure that the approach is correct way.. Any comments or suggestions are appreciated.

I’ll make some chemoinformatic Work Flow during my fright.

Python package of machine learning for imbalanced data #machine_learning #chemoinformatics

Recently I’m struggling with imbalanced data. I didn’t have any idea to handle it. So my predictive model showed poor performance.

Some days ago, I found useful package for imbalanced data learning which name is ‘imbalanced learn‘.

It can be installed from conda. The package provides methods for over sampling and under sampling.

I had interested it and try to use it for drug discovery dataset.

Following example used two methods, SMOTE(Synthetic Minority Over-sampling Technique) and ADASYN(Adaptive Synthetic sampling approach).

Both methods are over sampling approach so they generate artificial data.

At first import packages and load dataset.


%matplotlib inline
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython import display
from sklearn.decomposition import PCA
df = pd.read_csv('chembl_5HT.csv')
df = df.dropna()

Made imbalanced label.

# define class pIC50 >9 is active and other is inactive.
df['CLS'] = np.array(df.pchembl_value > 9, dtype=np.int)
pd.plotting.hist_series(df.CLS)

Then calculate finger print with RDKit.

mols = [Chem.MolFromSmiles(smi) for smi in df.canonical_smiles]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]

def fp2np(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

X = np.array([fp2np(fp) for fp in fps])
Y = df.CLS.to_numpy()
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, random_state=123, test_size=0.2)

Finally, apply the data to 3 approaches, default, SMOTE and ADASYN.

print(train_X.shape)
print(train_Y.shape)
print(sum(train_Y)/len(train_Y))
>(11340, 2048)
>(11340,)
>0.08686067019400352
rf = RandomForestClassifier(n_estimators=10)
rf.fit(train_X, train_Y)
pred_Y = rf.predict(test_X)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))
----out

              precision    recall  f1-score   support

           0       0.95      0.97      0.96      2586
           1       0.57      0.42      0.48       250

    accuracy                           0.92      2836
   macro avg       0.76      0.69      0.72      2836
weighted avg       0.91      0.92      0.92      2836

[[2506   80]
 [ 145  105]]

Then try to re sampling. After resampling, ratio of negative and positive is fifty fifty.

X_resampled, Y_resampled = SMOTE().fit_resample(train_X, train_Y)
print(X_resampled.shape)
print(Y_resampled.shape)
print(sum(Y_resampled)/len(Y_resampled))
>(20710, 2048)
>(20710,)
>0.5

rf = RandomForestClassifier(n_estimators=10)
rf.fit(X_resampled, Y_resampled)
pred_Y = rf.predict(test_X)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))
---out

              precision    recall  f1-score   support

           0       0.95      0.95      0.95      2586
           1       0.47      0.48      0.48       250

    accuracy                           0.91      2836
   macro avg       0.71      0.72      0.71      2836
weighted avg       0.91      0.91      0.91      2836

[[2451  135]
 [ 129  121]]
X_resampled, Y_resampled = ADASYN().fit_resample(train_X, train_Y)
print(X_resampled.shape)
print(Y_resampled.shape)
print(sum(Y_resampled)/len(Y_resampled))
>(20884, 2048)
>(20884,)
>0.5041658686075464
rf = RandomForestClassifier(n_estimators=10)
rf.fit(X_resampled, Y_resampled)
pred_Y = rf.predict(test_X)

---out
clsreport = classification_report(test_Y, pred_Y)
print(classification_report(test_Y, pred_Y))
print(confusion_matrix(test_Y, pred_Y))

              precision    recall  f1-score   support

           0       0.95      0.94      0.95      2586
           1       0.44      0.47      0.46       250

    accuracy                           0.90      2836
   macro avg       0.70      0.71      0.70      2836
weighted avg       0.90      0.90      0.90      2836

[[2437  149]
 [ 132  118]]

Hmm, re sampling results are not improved..

Let’s check chemical space with PCA.

pca = PCA(n_components=3)
res = pca.fit_transform(X)

col = {0:'blue', 1:'yellow'}
color = [col[np.int(i)] for i in Y]
plt.figure(figsize=(10,7))
plt.scatter(res[:,0], res[:,1], c=color, alpha=0.5)

Negative and positive(yellow) is located very close.

In my example, generated fingerprint is not true molecule. Finally did PCA with re sampled data.

pca = PCA(n_components=3)
res = pca.fit_transform(X_resampled)
col = {0:'blue', 1:'yellow'}
color = [col[np.int(i)] for i in Y_resampled]
plt.figure(figsize=(10,7))
plt.scatter(res[:,0], res[:,1], c=color, alpha=0.5)

Wow both class located same chemical space… It seems difficult to classify for me. But randomforest showed better performance.

I need to learn more and more how to handle imbalanced data in drug discovery….

Imbalanced-learn has also under sampling method. I would like to try under sampling method if I have time to do it.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/imbalanced_learn.ipynb

Transfer learning of DGMG for focused library gegneration #DGL #Chemoinformatics

Transfer learning is very useful method in deeplearning. Because it can use pre trained model and can re train with few parameters.

I think it is useful for molecular generator too. If it is useful for the generator, it can use for focused library generation. I posted about DGL molecular generation. So I tried to apply transfer learning with DGL.

At first, import several packages for coding.

import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit import RDPaths
from rdkit.Chem import Draw
from dgl.data.chem import utils
from dgl.model_zoo.chem import pretrain
from dgl.model_zoo.chem.dgmg import MoleculeEnv
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
import copy
mols = Chem.SDMolSupplier(f"{RDPaths.RDDocsDir}/Book/data/cdk2.sdf")
model = pretrain.load_pretrained('DGMG_ChEMBL_canonical')

Then made three copy of the DGMG model.

model1 = copy.deepcopy(model)
model2 = copy.deepcopy(model)
model3 = copy.deepcopy(model)

Down load utility function for chemical structure handling from DGL original repository.

!wget https://raw.githubusercontent.com/dmlc/dgl/master/examples/pytorch/model_zoo/chem/generative_models/dgmg/utils.py
!wget https://raw.githubusercontent.com/dmlc/dgl/master/examples/pytorch/model_zoo/chem/generative_models/dgmg/sascorer.py

Freeze upper layer and last choose_dest_agent layer set requires grad True.

from utils import MoleculeDataset
params = model1.parameters()
for idx, param in enumerate(params):
    param.requires_grad = False

params_ = model1.choose_dest_agent.parameters()
for idx, param in enumerate(params_):
    param.requires_grad = True
    
params = model2.parameters()
for idx, param in enumerate(params):
    param.requires_grad = False

params_ = model2.choose_dest_agent.parameters()
for idx, param in enumerate(params_):
    param.requires_grad = True

params = model3.parameters()
for idx, param in enumerate(params):
    param.requires_grad = False

params_ = model3.choose_dest_agent.parameters()
for idx, param in enumerate(params_):
    param.requires_grad = True

For convenience, each model learning only 10 epochs and check generated structure. At frist, check default output.

genmols = []
i = 0
while i < 20:
    SMILES = model(rdkit_mol=True)
    if Chem.MolFromSmiles(SMILES) is not None:
        genmols.append(Chem.MolFromSmiles(SMILES))
        i += 1
Draw.MolsToGridImage(genmols)/sourcecode]
<!-- /wp:shortcode -->

<!-- wp:image {"id":2814,"sizeSlug":"large"} -->
<figure class="wp-block-image size-large"><img src="https://iwatobipen.files.wordpress.com/2019/09/def1.png?w=439" alt="" class="wp-image-2814" /></figure>
<!-- /wp:image -->

<!-- wp:paragraph -->
<p>Defined cdk2 molecules, linear and cyclic molecules for additional learning.</p>
<!-- /wp:paragraph -->

<!-- wp:shortcode -->

atom_types = ['O', 'Cl', 'C', 'S', 'F', 'Br', 'N']
bond_types = [Chem.rdchem.BondType.SINGLE,
                               Chem.rdchem.BondType.DOUBLE,
                               Chem.rdchem.BondType.TRIPLE]
env = MoleculeEnv(atom_types, bond_types)
from utils import Subset
from utils import Optimizer
subs1 = Subset([Chem.MolToSmiles(mol) for mol in mols], 'canonical', env)
subs2 = Subset(['C1NCOC1' for _ in range(10)], 'canonical', env)
subs3 = Subset(['CNCOCC' for _ in range(10)], 'canonical', env)
loader1  = DataLoader(subs1, 1)
loader2  = DataLoader(subs2, 1)
loader3  = DataLoader(subs3, 1)

First trial is CDK2 molecules.

optimizer = Optimizer(0.1, Adam(model1.parameters(), 0.1))
model1.train()
for i in range(10):
    for data in loader1:
        optimizer.zero_grad()
        logp = model1(data, compute_log_prob=True)
        loss_averaged = - logp
        optimizer.backward_and_step(loss_averaged)
model1.eval()
genmols = []
i = 0
while i < 20:
    SMILES = model1(rdkit_mol=True)
    if Chem.MolFromSmiles(SMILES) is not None:
        genmols.append(Chem.MolFromSmiles(SMILES))
        i += 1
from rdkit.Chem import Draw
Draw.MolsToGridImage(genmols)

Hmm it seems bicyclic compound is more generated….?

Then learn cyclic molecule.

optimizer = Optimizer(0.1, Adam(model2.parameters(), 0.1))
model2.train()
for i in range(10):
    for data in loader2:
        optimizer.zero_grad()
        logp = model2(data, compute_log_prob=True)
        loss_averaged = - logp
        optimizer.backward_and_step(loss_averaged)
model2.eval()
genmols = []
i = 0
while i < 20:
    SMILES = model2(rdkit_mol=True)
    if Chem.MolFromSmiles(SMILES) is not None:
        genmols.append(Chem.MolFromSmiles(SMILES))
        i += 1
from rdkit.Chem import Draw
Draw.MolsToGridImage(genmols)

As expected, cyclic molecules are generated.

Finally let’s check linear molecule as learning data.

optimizer = Optimizer(0.1, Adam(model3.parameters(), 0.1))
model3.train()
for i in range(10):
    for data in loader3:
        optimizer.zero_grad()
        logp = model3(data, compute_log_prob=True)
        loss_averaged = - logp
        optimizer.backward_and_step(loss_averaged)
model3.eval()
genmols = []
i = 0
while i < 20:
    SMILES = model3(rdkit_mol=True)
    if Chem.MolFromSmiles(SMILES) is not None:
        genmols.append(Chem.MolFromSmiles(SMILES))
        i += 1
from rdkit.Chem import Draw
Draw.MolsToGridImage(genmols)

Wow many linear molecules are generated.

This is very simple example for transfer learning. I think it is useful method because by using the method, user don’t need to learn huge number of parameters.

My code is not so efficient because I don’t fully understand how to learn dgmg.

I would like to read document more deeply.

Reader who has interest the code, you can find whole code from following url.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/transf.ipynb

Graph based generative model of molecule #DGL #RDKit #chemoinformatics

Recently Graph based predictive model and generative model are attractive topic in chemoinformatics area. Because, Graph based model is not need learn grammar such like a SMILES based model. It seems more primitive representation of molecule. Of course to use Graph based model, used need to convert molecule to graph object.

Pytorch_geometric(PyG) and Deep Graph Library(DGL) are very useful package for graph based deep learning. Today, I got comment about my post from DGL developer. It is honor to me for getting a comment. And I could know that new version of DGL supports many methods in chemistry. It’s awesome work isn’t it!!!!

I try to use it. If reader can read Japanese (or can use translation module), there is a nice article. URL is below.
https://qiita.com/shionhonda/items/4a29121ea50cb8efd235
This post describes Junction Tree VAE.

So today, I used different model of DGL.

Following example is molecular generation with DGMG.

Fortunately, DGL provides pre-trained model. So user can use generative model without train model by yourself. Let’s start!!

At first import several packages and define splitsmi function. Because generative model sometime generates SMILES which has ‘.’ . So I would like to retrieve the largest strings from generated SMILES.

from rdkit import Chem
from rdkit.Chem import QED
from dgl.model_zoo.chem import load_pretrained
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import os
import math
import numpy as np
mpy as np
def splitsmi(smiles):
    smiles_list = smiles.split('.')
    length = [len(s) for s in smiles_list]
    return smiles_list[np.argmax(length)]
    

Then load pre trained model. It is very easy. Just call load_pretrained function! Following code load two models, one is trained with ChEMBL and the other is trained with ZINC. I picked up 30 molecules which are Sanitizable and QED is over 0.6.

chembl_model = load_pretrained('DGMG_ChEMBL_canonical')
chembl_model.eval()
chembl_mols = []
chembl_qeds = []
while len(chembl_mols)  0.6:
            chembl_mols.append(mol)
            chembl_qeds.append(str(np.round(qed, 2)))
    except:
        pass
Draw.MolsToGridImage(chembl_mols, legends=chembl_qeds, molsPerRow=5)

zinc_model = load_pretrained(‘DGMG_ZINC_canonical’)
zinc_model.eval()

[/sourzinc_mols = []
zinc_qeds = []
while len(zinc_mols) 0.6:
zinc_mols.append(mol)
zinc_qeds.append(str(np.round(qed,2)))
except:
pass
Draw.MolsToGridImage(zinc_mols, legends=zinc_qeds, molsPerRow=5)cecode]

Generated molecules are….

from CHEMBL
from ZINC

Generated molecules are diverse, but not so undruggable structure I think.

Also user can build your own generative model from your own data set.

Recently we can access lots of information from many sources twitter, github, arxiv, blog and etc…

Many codes are freely available. It’s worth because you can evaluate the code if you want. And you can have chance for new findings.

I really respect all developer and feel I have to learn more and more…

Any way DGL is very useful package for chemoinformatian who has interest to Graph based DL I think. ;-)

Today’s code can check from following URL.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/MODEL_ZOO_DGMG_EXAMPLE.ipynb