In 2012, lilly’s researchers published Lilly-MedChem Rules in J. Med. Chem. and disclosed their code on github. After the publication, the rules are used in many applications, papers and chemoinformatics applications. Open source tool made a big impact on chemoinformatics. Several hours ago I found an interesting tweet from @jcheminf.
They reported an algorithm of retro-synthesis. Data driven retrosynthetic analysis is hot topics in chemoinformatics area I think.
The article is published from Lilly and the author uploaded source code on github. URL is below. https://github.com/EliLillyCo/LillyMol
Their implementation is different from Segler’s approach ‘Learning plan to chemical synthesis‘. They do not use machine learning approach but use Reverse Reaction Template (RRT) based approach. RRT defines reaction rules and it extracted from mapped reaction data such as Lowe’s US patent dataset.
At first they made RRT repository and used it for analysis. After making the repository. Researcher inputs query structure to the system, the system will search RRT which is applicable for the query and recode it when matched.
Key point is how to make RRT I think. More details are described in the article.
They benchmarked their system with 919 known drug structures from drug bank. The performance of results seems depends on settings of RRT, radius and support. radi-0 RRT seems more general than radi-2 RRT, it likes ECFP.(Table 2)
After reading the article, I would like to use the code. OK let’s try it!
I have checked the repo last year but the code supports linux only at that time. However now, the code supports not only linux but also OSx. ;-)
It is easy to install the tool-kit. OK let’s install the TK and use it.
For installation, gcc >= 6.2.0 and zlib>=1.2.11 are required, so I installed them with home brew.
iwatobipen$ brew install zlib iwatobipen$ brew install gcc
Then clone the repository and change ZLIB part in makefile.public.OSX-gcc-8. I installed zlib via Homebrew, so I changed ZLIB to ‘/usr/loca/Cellar/zlib….’.
All code are implemented in C++ and the code does not use any chemoinformatics packages such as RDKit, openbabel and CDK!! @_@
iwatobipen$ git clone https://github.com/EliLillyCo/LillyMol.git iwatobipen$ cd LillyMol iwatobipen$ vim makefile.public.OSX-gcc-8 ### -- ZLIB = /usr/local/opt/zlib/lib ++ ZLIB = ZLIB = /usr/local/Cellar/zlib/1.2.11/lib ###
Now ready! After makefile change, run the makeall.sh. After wait several minutes, installation will finish. All commands are generated in ./bin/OSX-gcc-8/. There are many commands are provided.
iwatobipen$ cd bin/OSX-gcc-8/ iwatobipen$ ls activity_consistency iwcut msort ring_extraction rxn_substructure_search trxn common_names iwdemerit preferred_smiles ring_trimming smiles_mutation tsubstructure concat_files mol2qry random_smiles rotatable_bonds sp3_filter unique_molecules fetch_smiles_quick molecular_scaffold retrosynthesis rxn_signature tautomer_generation unique_rows fileconv molecule_subset rgroup rxn_standardize tp_first_pass
Details of the commands are described in the wiki page.
I checked retrosynthesis code with example data. It is a little difficult to set options for me.
iwatobipen$ cd ./example/retrosynthesis iwatobipen$ cat 1Cmpds.smi > C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O iwatobipen$ ../../bin/OSX-gcc-8/retrosynthesis -Y all -X kg -X kekule -X ersfrm -a 2 -q f -v -R 1 -I CentroidRxnSmi_1 -P UST:AZUCORS 1Cmpds.smi >log.txt 2>err.txt
Check log.txt and err.txt.
iwatobipen$ cat log.txt C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O PARENT O.Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1 via US03992389_NA CentroidRxnSmi_1 R 1 ALL Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1 via US03992389_NA CentroidRxnSmi_1 R 1 SPFRM.1 Oc1ccc(NC(=O)C)cc1 via US03992389_NA CentroidRxnSmi_1 R 1 O=C via US03992389_NA CentroidRxnSmi_1 R 1 O=C(NC1CCNCC1)c1ccccc1 via US03992389_NA CentroidRxnSmi_1 R 1 iwatobipen$ cat err.txt Will not write product fragments with fewer than 2 atoms Will keep going after an individual test failure Will preserve Kekule forms Will use the reaction file name as the reaction name Reading reactions took 0 seconds read mol smi eof Read 1 molecules, 1 deconstructed 1 molecules deconstructed at radius 1 0 deconstructions done at radius 0 1 deconstructions done at radius 1 Set_of_Reactions: CentroidRxnSmi_1 with 164 reactions 2 molecules deconstructed at radius 1 2 molecules deconstructed Set_of_Reactions: CentroidRxnSmi_1 with 164 reactions 1 US03947458_NA 1 searches, 0 matches found 1 US03947473_NA 1 searches, 0 matches found 1 US03989717_NA 1 searches, 0 matches found ----- snip ;-) ------ 1 US20160002218A1_0322 1 searches, 0 matches found 1 US20160200725A1_0864 1 searches, 0 matches found 2 molecules deconstructed at radius 1 2 molecules deconstructed 163 reactions had 0 hits 1 reactions had 1 hits
It is difficult to understand smiles strings directly for me, OK let’s visualize with RDKit!
from rdkit import Chem from rdkit.Chem.Draw import IPythonConsole from rdkit.Chem import Draw parent = Chem.MolFromSmiles('C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O') mol1 = Chem.MolFromSmiles('O.Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1') mol2 = Chem.MolFromSmiles('Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1') mol3 = Chem.MolFromSmiles('Oc1ccc(NC(=O)C)cc1') mol4 = Chem.MolFromSmiles('O=C') mol5 = Chem.MolFromSmiles('O=C(NC1CCNCC1)c1ccccc1') Draw.MolsToGridImage([parent, mol1, mol2, mol3, mol4, mol5])
Example code worked without any problems. But it failed when I used original molecule as a query. One of the reason is that I used very limited training data. I used default data for my test. It has only 164 rxns.
I would like to try to make RRT with large data set.