Graph based generative model of molecule #DGL #RDKit #chemoinformatics

Recently Graph based predictive model and generative model are attractive topic in chemoinformatics area. Because, Graph based model is not need learn grammar such like a SMILES based model. It seems more primitive representation of molecule. Of course to use Graph based model, used need to convert molecule to graph object.

Pytorch_geometric(PyG) and Deep Graph Library(DGL) are very useful package for graph based deep learning. Today, I got comment about my post from DGL developer. It is honor to me for getting a comment. And I could know that new version of DGL supports many methods in chemistry. It’s awesome work isn’t it!!!!

I try to use it. If reader can read Japanese (or can use translation module), there is a nice article. URL is below.
This post describes Junction Tree VAE.

So today, I used different model of DGL.

Following example is molecular generation with DGMG.

Fortunately, DGL provides pre-trained model. So user can use generative model without train model by yourself. Let’s start!!

At first import several packages and define splitsmi function. Because generative model sometime generates SMILES which has ‘.’ . So I would like to retrieve the largest strings from generated SMILES.

from rdkit import Chem
from rdkit.Chem import QED
from dgl.model_zoo.chem import load_pretrained
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import os
import math
import numpy as np
mpy as np
def splitsmi(smiles):
    smiles_list = smiles.split('.')
    length = [len(s) for s in smiles_list]
    return smiles_list[np.argmax(length)]

Then load pre trained model. It is very easy. Just call load_pretrained function! Following code load two models, one is trained with ChEMBL and the other is trained with ZINC. I picked up 30 molecules which are Sanitizable and QED is over 0.6.

chembl_model = load_pretrained('DGMG_ChEMBL_canonical')
chembl_mols = []
chembl_qeds = []
while len(chembl_mols)  0.6:
            chembl_qeds.append(str(np.round(qed, 2)))
Draw.MolsToGridImage(chembl_mols, legends=chembl_qeds, molsPerRow=5)

zinc_model = load_pretrained(‘DGMG_ZINC_canonical’)

[/sourzinc_mols = []
zinc_qeds = []
while len(zinc_mols) 0.6:
Draw.MolsToGridImage(zinc_mols, legends=zinc_qeds, molsPerRow=5)cecode]

Generated molecules are….

from ZINC

Generated molecules are diverse, but not so undruggable structure I think.

Also user can build your own generative model from your own data set.

Recently we can access lots of information from many sources twitter, github, arxiv, blog and etc…

Many codes are freely available. It’s worth because you can evaluate the code if you want. And you can have chance for new findings.

I really respect all developer and feel I have to learn more and more…

Any way DGL is very useful package for chemoinformatian who has interest to Graph based DL I think. ;-)

Today’s code can check from following URL.


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

8 thoughts on “Graph based generative model of molecule #DGL #RDKit #chemoinformatics

  1. Generated compounds QED scores are interesting.
    I wonder if it is possible to build a model trained on protein-ligand complexes? The model will then have to generate molecules based on the binding site structures only.

    1. I think it’s difficult because DGMG model can learn from ligand structure only. So if you would like to build model which generates ligand like molecule, how about to use dataset which is retrieved from PDB or binding DB.

      1. Yes the PDB bind has 16,151 protein ligand complexes (2018,
        But I think the ideal training set will be all ligands for a single target and see how the generated compounds perform on such ligand. ChemBL API ( will be a good starting point to look for target with many active ligand structures.

    2. Hi Bakary, there might exist some efforts in the direction given that the topic has been quite popular. Deep learning methods are mostly effective with a large amount of data. The question is then whether there is enough public data with paired molecules and binding site structures.

      1. How are large should the training sample will for a generative model for example?

  2. Thanks for the kind words. I’ve learned a lot from your blogs when I get started with RDKit and this effort will not be possible without generous sharing like yours :)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: