Compound Generator with Graph Networks, GraphINVENT #chemoinformatics #RDKit #PyTorch

Here is a new article from Esben et. al. about de novo compound generator with graph network which is named GraphINVENT.


Graph based approach has advantage for compound generation compared to SMILES based approach. It doesn’t need to learn grammar of SMILES. Graph approach represents molecule as graph, atom is node and bond is edge. However currently SMILES based approach works very well even if the approach is required to additional task for learning compound structure. So I have an interest this work to check current status of Graph based compound generation.

Their proposed platform named ‘GraphINVENT’ used six different GNNs! Wow it seems very complex for me. And the implementation is used rdkit and pytorch without using any graph NN frame work such as PyG or DGL.

They compared several compound generation algorithm with MOSES and discussed advantage and disadvantage of GGNN based compound generator. GraphINVENT can generate valid molecules but difficult to tuning hyper parameters.

Fortunately the authors disclosed source code so I tried to use GraphINVENT.

At first I made new conda env for the test. The details are described in original repository.

$ git clone
$ cd GraphINVENT
$ conda env create -f environments/GraphINVENT-env.yml
$ conda activate GraphINVENT-env

Now I created test env for GraphINVENT and then I modified default setting because my PC has GPU (GeForce GTX 1650) but it has enough size for gpu based learning with default settings. And edited python path of

GraphINVENT$ vim graphinvent/parameters/
# edit batch_size and block_size to smaller size
# line 96
model_common_hp_dict = {
    "batch_size": 100, # 1000>>100
    "block_size": 1000, # 100000>>1000

GraphINVENT$ vim 
# set paths here
python_path = f"/home/iwatobipen/miniconda3/envs/GraphINVENT-env/bin/python"
graphinvent_path = f"./graphinvent/"
data_path = f"./data/"

Now ready to run. Move to graphinvent folder and run

GraphINVENT$ cd graphinvent
GraphINVENT/graphinvent$ python
* Running job using HDF datasets located at /home/iwatobipen/dev/GraphINVENT/data/gdb13_1K/
* Checking that the relevant parameters match those used in preprocessing the dataset.
-- Job parameters match preprocessing parameters.
* Run mode: 'train'
* Loading preprocessed training set.
-- time elapsed: 0.00075 s
* Loading preprocessed validation set.
-- time elapsed: 0.00123 s
* Loading training set properties.
* Touching output files.
* Defining model and optimizer.
-- Initializing model from scratch.
-- Defining optimizer.
* Beginning training.
* Training epoch 1.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:05<00:00,  2.05it/s]
* Training epoch 2.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:05<00:00,  2.10it/s]
* Training epoch 3.
 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 115/116 [00:06<00:00, 18.78it/s]Learning rate has reached minimum lr.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 116/116 [00:06<00:00, 17.93it/s]
* Generating 100 molecules.
Batch 0 of 0
102it [00:00, 325.18it/s]                                                                                                                                   
Generated 102 molecules in 0.3228 s
--316.01 molecules/s
* Evaluating model.
-- Calculating NLL statistics for validation set.
-- Calculating NLL statistics for training set.
* Saving model state at Epoch 100.
-- time elapsed: 653.27597 s

After training, generated molecules are stored in GraphINVENT/output/generation folder. OK let’s check the structure.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdDepictor
## At first, I loaded training compound molecules.
trainsmi = []
with open('data/gdb13_1K/test.smi', 'r') as train:
    for i, l in enumerate(train):
        if i == 0:
        trainsmi.append(l.split(' ')[0])

## These compounds are generated from GGNN
gensmi = []
with open('output/generation/epoch100_0.smi', 'r') as gen:
    for i, l in enumerate(gen):
        if i == 0:
        gensmi.append(l.split(' ')[0])

trainmol = [Chem.MolFromSmiles(smi) for smi in trainsmi]
genmol =  [Chem.MolFromSmiles(smi) for smi in gensmi]

DrdImage(trainmol[:20], molsPerRow=5)
Draw.MolsToGridImage(genmol[50:], molsPerRow=5)

Here is a test molecules. GDB13 molecules seems undruggable structure.

Next compounds are generated molecules.

Hmm…. Some compounds have large size of rings (7~8 membered ring) and unstable substructures. Example molecules in the article shows more druggable structures so it depends on training dataset and which algorithm is used in the GraphINVENT. I could check that the code work well in my PC but I should train with more suitable dataset and GPUs for the code evaluation.

Which do you like RNN based compounds generation or GGNN based compounds generation? ;)


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: