Particle Swarm Optimization for molecular design #RDKit #Chemoinformatics

I participated RDKit UGM last week. It was worth to go I think. And in the meeting I got useful information for de novo molecular design. You can find the slide deck following URL. https://github.com/rdkit/UGM_2019/blob/master/Presentations/Montanari_Winter_Utilizing_in_silico_models.pdf

They used Particle Swarm Optimization(PSO) for de novo molecular design. PSO is very simple method for parameter optimization. Details are described in wikipedia.

The algorithm of PSO is similar to Q-learning. PSO try to improve objective function by updating velocity during the iteration.

Fortunately, PSO algorithm for molecular generation is disclosed in github. ;)

So I installed mso from github and tried to use it.

At first I installed cddd because mso is depended with cddd. cddd encode molecules to latent space and decode latent space to molecules.
. In the readme of cddd, tensorflow==1.10 is required but it worked tensorflow==1.13. Then installed mso. Both library can install from github.

$ git clone https://github.com/jrwnter/cddd.git
$ cd cddd
$ pip install -e .
$ cd ../
$ git clone https://github.com/jrwnter/mso.git
$ cd mso
$ pip install -e .

After installing the package I used mso. To convenience, I used pretrained model which is provided from cddd repo. Downloaded default model from following URL and stored ./cddd/data folder and unzip it. URL: https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h

Now ready, let’s try.

MSO optimize particle positions with objective functions. The package have many functions. Such as QED, substructure and alert, etc.
Multiple combination is scoring function is also available.

Following example is simple. At first, I used substructure and QED function. Substructure function return 1 if generated molecule has user defined structure. It is useful because RNN based generator often generates molecule randomly so it is difficult to keep specific substructure.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

from mso.optimizer import BasePSOptimizer
from mso.objectives.scoring import ScoringFunction
from mso.objectives.mol_functions import qed_score
from mso.objectives.mol_functions import sa_score
from mso.objectives.mol_functions import substructure_match_score
from functools import partial
from cddd.inference import InferenceModel

sub_match = partial(substructure_match_score, query=Chem.MolFromSmiles('c1ccncc1'))


init_smiles = "c1c(C)cccc1" 
scoring_functions = [ScoringFunction(func=qed_score, name="qed", is_mol_func=True), ScoringFunction(func=sub_match, name='subst', is_mol_func=True)]

substructure_match_score function requires several arguments including substructure. To use it, partial function is used for freeze several args. And then pass functions to ScoringFunction class.

infermodel = InferenceModel()
opt = BasePSOptimizer.from_query(
    init_smiles=init_smiles,
    num_part=200,
    num_swarms=1,
    inference_model=infermodel,
    scoring_functions=scoring_functions)

Then defined optimizer instance. num_part shows number of particles which means number of molecules. And finally run the optimization.

res = opt.run(20)
res0 = res[0]

res0 has best scored smiles and other generated smiles. I retrieve them and draw. Fitness is normalized score.


mols = []
for idx, smi in enumerate(res0.smiles):
    
    try:
        mol = Chem.MolFromSmiles(smi)
        Chem.SanitizeMol(mol)
        if mol != None and res0.fitness[idx] > 0.7:
            print(smi)
            mols.append(mol)
    except:
        pass

OK, generated molecules has use defined substructure. However it seems similar compounds, does it depend on training set?

opt instance has fitness history as pandas dataframe. Fortunately rdkit can PandasTools. ;)


from IPython.display import HTML
from rdkit.Chem import PandasTools
PandasTools.AddMoleculeColumnToFrame(opt.best_fitness_history, smilesCol='smiles')
HTML(opt.best_fitness_history.to_html())

It’s very easy. MSO optimizer does not use GPU, so the process works very fast if user has many CPUs.

MSO seems useful package for molecular generation. User can also define your own objective functions. I use it more deeply.

Today’s code is uploaded to gist.

New version of RDKit

Some days ago, new version of RDKit (2015.09.01) was released.
I’m looking forward to this release.
A lots of bug fix and new implementation was. I really thank to developers.
Mac user can install rdkit using Homebrew ! (anaconda is not yet)
I installed rdkit using homebrew, there are no trouble in El capitan.
One of new function is PAINS filter implementation. (Also more and more new function , I’ll check API ASAP.)

Sample code is following .

from __future__ import print_function
In [4]: from rdkit import rdBase
In [5]: rdBase.rdkitVersion
Out[5]: '2015.09.1'
from rdkit import Chem
from rdkit.Chem import FilterCatalog
params = FilterCatalog.FilterCatalogParams()
# set filter PAINS, PAINS_A, B, C...
params.AddCatalog( params.FilterCatalogs.PAINS )
# set params to catalog
catalog = FilterCatalog.FilterCatalog( params )
# contain AZO
mol = Chem.MolFromSmiles("c1ccccc1N=Nc1ccccc1")
# clearn
mol2 = Chem.MolFromSmiles("c1ccccc1OC")
#Check each mol.
In [27]: catalog.HasMatch(mol)
Out[27]: True
In [28]: catalog.HasMatch(mol2)
Out[28]: False

#Get Description
ent = catalog.GetFirstMatch(mol)
In [32]: ent.GetDescription()
Out[32]: 'azo_A(324)'

Work fine.
I think that RDKit is one of the best choice to do chemoinformatics.

Coooooooool.