I’m in summer vacation from today. Due to pandemic, we don’t have plan to go travel in this summer vacation ;( Hope the situation will go soon….
As reader know recently SMILES based de novo design is used for not only material design but also drug discovery project. Some years ago, the approach generates many invalid molecules because it is difficult to learn grammar of SMILES. However recently RNN based approach works very well also other approaches works well too GAN, Graph Based and image based(???). And chemoinformatitian can generate focused compound set with RNN generator and transfer learning technique.
I would like to introduce a nice article about guide line of SMLIES based generator.
They investigated the effect of data set and number of epochs for transfer learning. They used REINVENT(RNN based generator) and made base model with ChEMBL data set. Then preformed transfer learning with some kinds of specific data set such as target focused data, patent data etc.
I don’t describe details about the article here if reader who has interest the article please check it ;)
Their results are interesting for me. It indicates that the model which is trained large and general compound data can generate diverse of valid molecule and also indicates that it can learn specific compound feature(distribution) with small amount of compound set. For example macro cyclic compound, spiro cyclic block containing compound.
It means that to build focused library generator, user don’t need to prepare large amount of focused training data set but need to prepare general data set for learning SMILES grammar and small data set for transfer learning.
Now we can use many open source based de novo compound generator algorithm and techniques. Is there best way to do de novo design? No, it depends on our situation and requirements ;)
…. There are many publication and codes are available in these days…. I need to keep studying and opening my eyes……