Recently Graph based predictive model and generative model are attractive topic in chemoinformatics area. Because, Graph based model is not need learn grammar such like a SMILES based model. It seems more primitive representation of molecule. Of course to use Graph based model, used need to convert molecule to graph object.
Pytorch_geometric(PyG) and Deep Graph Library(DGL) are very useful package for graph based deep learning. Today, I got comment about my post from DGL developer. It is honor to me for getting a comment. And I could know that new version of DGL supports many methods in chemistry. It’s awesome work isn’t it!!!!
I try to use it. If reader can read Japanese (or can use translation module), there is a nice article. URL is below.
https://qiita.com/shionhonda/items/4a29121ea50cb8efd235
This post describes Junction Tree VAE.
So today, I used different model of DGL.
Following example is molecular generation with DGMG.
Fortunately, DGL provides pre-trained model. So user can use generative model without train model by yourself. Let’s start!!
At first import several packages and define splitsmi function. Because generative model sometime generates SMILES which has ‘.’ . So I would like to retrieve the largest strings from generated SMILES.
from rdkit import Chem from rdkit.Chem import QED from dgl.model_zoo.chem import load_pretrained from rdkit.Chem.Draw import IPythonConsole from rdkit.Chem import Draw import os import math import numpy as np mpy as np def splitsmi(smiles): smiles_list = smiles.split('.') length = [len(s) for s in smiles_list] return smiles_list[np.argmax(length)]
Then load pre trained model. It is very easy. Just call load_pretrained function! Following code load two models, one is trained with ChEMBL and the other is trained with ZINC. I picked up 30 molecules which are Sanitizable and QED is over 0.6.
chembl_model = load_pretrained('DGMG_ChEMBL_canonical') chembl_model.eval() chembl_mols = [] chembl_qeds = [] while len(chembl_mols) 0.6: chembl_mols.append(mol) chembl_qeds.append(str(np.round(qed, 2))) except: pass Draw.MolsToGridImage(chembl_mols, legends=chembl_qeds, molsPerRow=5)
zinc_model = load_pretrained(‘DGMG_ZINC_canonical’)
zinc_model.eval()
[/sourzinc_mols = []
zinc_qeds = []
while len(zinc_mols) 0.6:
zinc_mols.append(mol)
zinc_qeds.append(str(np.round(qed,2)))
except:
pass
Draw.MolsToGridImage(zinc_mols, legends=zinc_qeds, molsPerRow=5)cecode]
Generated molecules are….


Generated molecules are diverse, but not so undruggable structure I think.
Also user can build your own generative model from your own data set.
Recently we can access lots of information from many sources twitter, github, arxiv, blog and etc…
Many codes are freely available. It’s worth because you can evaluate the code if you want. And you can have chance for new findings.
I really respect all developer and feel I have to learn more and more…
Any way DGL is very useful package for chemoinformatian who has interest to Graph based DL I think. ;-)
Today’s code can check from following URL.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/MODEL_ZOO_DGMG_EXAMPLE.ipynb
Generated compounds QED scores are interesting.
I wonder if it is possible to build a model trained on protein-ligand complexes? The model will then have to generate molecules based on the binding site structures only.
I think it’s difficult because DGMG model can learn from ligand structure only. So if you would like to build model which generates ligand like molecule, how about to use dataset which is retrieved from PDB or binding DB.
Thanks.
Yes the PDB bind has 16,151 protein ligand complexes (2018, http://webcache.googleusercontent.com/search?q=cache:http://www.pdbbind-cn.org/download/pdbbind_2018_intro.pdf).
But I think the ideal training set will be all ligands for a single target and see how the generated compounds perform on such ligand. ChemBL API (https://github.com/chembl/chembl_webresource_client) will be a good starting point to look for target with many active ligand structures.
an interesting related talk here: https://www.youtube.com/watch?v=c77B60_QIYI
Hi Bakary, there might exist some efforts in the direction given that the topic has been quite popular. Deep learning methods are mostly effective with a large amount of data. The question is then whether there is enough public data with paired molecules and binding site structures.
How are large should the training sample will for a generative model for example?
Thanks for the kind words. I’ve learned a lot from your blogs when I get started with RDKit and this effort will not be possible without generous sharing like yours :)
Thanks. It’s really great work. It is very useful for not only many chemoinformatitians but also MedChem I think.