Think about de novo molecule generation #memo #journal #RDKit #CReM

Recently there are many publications about de-novo molecular generator which mainly use Deep Learning. One problem of the approach is that generated molecules are not systematic so it’s difficult to synthesis them with parallel chemistry. So sometime chemists dislike the proposal from generated form the method I think.

Rule or Rxn or MMP based molecule generation is another approach to do that. It’s based on more chemist friend rules. They are not new but useful method and also related approaches are still reported in these days.

Some days ago I found new article in J. Cheminform. The title was ‘CReM: chemically reasonable mutation framework for structure generation’. URL is below.

The author proposed new workflow for molecular framework mutation it seems like MMP approach, it degrade molecule to fragment with local context (radi1-5) for making interchangable fragment database, like MMP key-value structures are stored. And the data is used for ‘MUTATE’, ‘GROW’ and ‘LINK’ for new structure generation.

I felt that the article is very similar to following ACS article reported by Kawai et al.

They proposed similar approach for molecular generation with fragment database.

Compared these approach, I think main difference is that CreM can set context radius. The setting affects feature of generated molecules.

In the fig4, and fig5 of the J. Cheminform article, the author shows properties of generated molecules with different radius. For example novelty, diversity score is decreased when large radius(5) is used. It means that more context similar compounds are generated with the setting.

As you know, CreM author disclosed the implementation so let’s use it. It is easy to install crem has very few dependency, just rdkit and gaucamol(optional). At first I installed CReM with pip and get ready to use DB.

$ git clone
$ cd crem
$ pip install .
# get data set Thanks for providing the db!
$ wget
$ gzip -d replacements02_sc2.5.db.gz

And I uploaded an example code on my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

By using the dataset, it took few minutes for structure generation. After generating the molecule rdkit can render mols with Drawing function.

It seems that radius=1 generates more diverse compounds set. It is easy to use for molecular generation.

Ok now we can use deep learning based and rule based structure generator. Each methods has pros and cons. As author said that CReM can generate chemistry reasonable structure but can’t generate new rings which isn’t fragment db.

Which is good proposal for medchem new structure constructed from know fragments or new structure with novel fragments?

It’s depends on situation but novel fragments requires new chemistry or many wet experiments. AI driven drug discovery can’t replace all wet experiments to dry experiments. Which molecules do you make at first and next, experimental design is key for the many projects.

Have a nice weekend. ;)


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

8 thoughts on “Think about de novo molecule generation #memo #journal #RDKit #CReM

  1. Just found your blog!

    Agree with you on Rules Based Design, how about using reinforcement learning?

    Also regarding validity, are you aware of SELFIES? It’s an alternative to smiles thats apparently 100% valid at all times as an encoding approach.

    1. Sure! I read SELFIES but I haven’t tested it. Also deepsmiles not tested in actual project too. Did you have any experience about them?

      Reinforcement learning will work in the denovo molecule generation. However it is difficult to reward setting in the real drug discovery project. Reward will be changed by progress of projects. As you know drug discovery is not closed environment.

      1. I’ve spoken to the creator of SELFIES and it seems very watertight – theyre of the opinion that the issue of uncertainty is solved, and I am keen to agree. In the paper, the VAE-GAN seems to work fine.

        There are a few papers on the topic (you’ve followed my github so you should see some of the repositories) by Popova et al. or Zhou et al. The methodlogy is sound, but you are completely correct – rewards are hard to identify. QED seems to just serve as a sanity check!

      2. I installed SELFIES in my PC and now checking related article. It seems fun.
        If I have chance and time, I’ll make post about it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: