I participated RDKit UGM last week. It was worth to go I think. And in the meeting I got useful information for de novo molecular design. You can find the slide deck following URL. https://github.com/rdkit/UGM_2019/blob/master/Presentations/Montanari_Winter_Utilizing_in_silico_models.pdf
They used Particle Swarm Optimization(PSO) for de novo molecular design. PSO is very simple method for parameter optimization. Details are described in wikipedia.
The algorithm of PSO is similar to Q-learning. PSO try to improve objective function by updating velocity during the iteration.
Fortunately, PSO algorithm for molecular generation is disclosed in github. ;)
So I installed mso from github and tried to use it.
At first I installed cddd because mso is depended with cddd. cddd encode molecules to latent space and decode latent space to molecules.
. In the readme of cddd, tensorflow==1.10 is required but it worked tensorflow==1.13. Then installed mso. Both library can install from github.
After installing the package I used mso. To convenience, I used pretrained model which is provided from cddd repo. Downloaded default model from following URL and stored ./cddd/data folder and unzip it. URL: https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h
Now ready, let’s try.
MSO optimize particle positions with objective functions. The package have many functions. Such as QED, substructure and alert, etc.
Multiple combination is scoring function is also available.
Following example is simple. At first, I used substructure and QED function. Substructure function return 1 if generated molecule has user defined structure. It is useful because RNN based generator often generates molecule randomly so it is difficult to keep specific substructure.
from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem.Draw import IPythonConsole from rdkit.Chem import Draw from mso.optimizer import BasePSOptimizer from mso.objectives.scoring import ScoringFunction from mso.objectives.mol_functions import qed_score from mso.objectives.mol_functions import sa_score from mso.objectives.mol_functions import substructure_match_score from functools import partial from cddd.inference import InferenceModel sub_match = partial(substructure_match_score, query=Chem.MolFromSmiles('c1ccncc1')) init_smiles = "c1c(C)cccc1" scoring_functions = [ScoringFunction(func=qed_score, name="qed", is_mol_func=True), ScoringFunction(func=sub_match, name='subst', is_mol_func=True)]
substructure_match_score function requires several arguments including substructure. To use it, partial function is used for freeze several args. And then pass functions to ScoringFunction class.
infermodel = InferenceModel() opt = BasePSOptimizer.from_query( init_smiles=init_smiles, num_part=200, num_swarms=1, inference_model=infermodel, scoring_functions=scoring_functions)
Then defined optimizer instance. num_part shows number of particles which means number of molecules. And finally run the optimization.
res = opt.run(20) res0 = res
res0 has best scored smiles and other generated smiles. I retrieve them and draw. Fitness is normalized score.
mols =  for idx, smi in enumerate(res0.smiles): try: mol = Chem.MolFromSmiles(smi) Chem.SanitizeMol(mol) if mol != None and res0.fitness[idx] > 0.7: print(smi) mols.append(mol) except: pass
OK, generated molecules has use defined substructure. However it seems similar compounds, does it depend on training set?
opt instance has fitness history as pandas dataframe. Fortunately rdkit can PandasTools. ;)
from IPython.display import HTML from rdkit.Chem import PandasTools PandasTools.AddMoleculeColumnToFrame(opt.best_fitness_history, smilesCol='smiles') HTML(opt.best_fitness_history.to_html())
It’s very easy. MSO optimizer does not use GPU, so the process works very fast if user has many CPUs.
MSO seems useful package for molecular generation. User can also define your own objective functions. I use it more deeply.
Today’s code is uploaded to gist.