Psikit update/Draw ESP, HOMO LUMO #RDKit #Chemoinformatics #quantumchemistry

I just updated psikit which is package for quantum-chemoinformatics ;)

It can be installed from conda / pypi :)

I added and updated new function for molecular property rendering.

Current version of psikit can draw not only frontier orbital but also ESP and dual descriptor. Dual descriptor is calculated by psi4. What is dual descriptor? From original ducment.

Calculates the dual descriptor from frontier orbitals: 𝑓2(𝐫)=𝜌LUMO(𝐫)−𝜌HOMO(𝐫)f2(r)=ρLUMO(r)−ρHOMO(r). The dual descriptor is a good measure of nucleophilicity and electrophilicity, containing information essentially equivalent to both Fukui functions combined. More details on the dual descriptor itself can be found in [Morell:2005:205], while the current implementation is described in [Martinez-Araya:2015:451]. This feature is currently only supported for closed shell systems.

It is very easy to get these images!

Let’s test the functions with acetic acid as an example.

Following code is almost borrowed form the UGM material. Thanks for sharing nice code. Import packages, read Reaction data and reaction objects at first. For convenience, I recommend to install ipymol at first.

 
import ipymol
from psikit import Psikit
pk = Psikit()
v = ipymol.viewer
v.start() # pymol will launch

Calculate energy of acetic acid.

Then call getMOview() for getting some cube files.


pk.getMOview()

Now data preparation is finished. I can get several views just call view_on_pymol with target option which I would like to draw in pymol.

 
pk.view_on_pymol(target='ESP')
pk.view_on_pymol(target='DUAL', maprange=0.001)
pk.view_on_pymol('FRONTIER')

Now I could get 3 views in pymol. Following images are results.

ESP
DUAL
LUMO
HOMO

Visualization of quantum chemistry properties are useful for medchem I think. Any comments, requests and suggestions are greatly appreciated.

code example

Today’s code is uploaded my repo and URL is below.

https://github.com/iwatobipen/playground/blob/master/new_function_psikit.ipynb

Advertisements

Enumerate partial heteroaromatic rings in a molecule #RDKit #Chemoinformatics

I posted hetero shuffling before. It worked well but redundant. There is a nice code in RDKit UGM2017 material. URL is below.

https://github.com/rdkit/UGM_2017/blob/master/Notebooks/Cole-Enumerate-Heterocycles.ipynb

The code defined transformation with hard coding and seems nice.

In case of real project, we sometime would like to do enumeration against partial substructure not all structure. I thought how to do it.

Fortunately RDKit can do it by setting “_protected” property of Atoms. It is worth to know (you know, the approach is described in RDKit document of course!).

Following code is almost borrowed form the UGM material. Thanks for sharing nice code. Import packages, read Reaction data and reaction objects at first.

 
from __future__ import print_function
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
import copy
import numpy as np

import pandas as pd
csvfile = './data/heterocycle_reactions.csv'

import csv
smarts_reader = csv.DictReader(open(csvfile))
REACTIONS = []
for row in smarts_reader:
    smarts = row['SMARTS']
    if not smarts:
        continue

    for product in row['CONVERT_TO'].split(','):
        reaction = smarts + '>>' + product
        REACTIONS.append(AllChem.ReactionFromSmarts(reaction))

Then define some functions. I used mol object as an input directly instead of SMILES.

, isomericSmiles=True)
if isosmi in unique:
continue
unique.add(isosmi)
Chem.SanitizeMol(newmol[0])
yield newmol[0]

def enumerate_heterocycles(mol):
start = mol
starting_points = [start]
seen = set()
while starting_points:
for newmol in get_unique_products(starting_points.pop()):
newmol_smiles = Chem.MolToSmiles(newmol)
if newmol_smiles in seen:
continue
starting_points.append(newmol)
seen.add(newmol_smiles)
yield newmol

Now ready to check it.

I used capivasertib which is kinase inhibitor as an example.

 
rwmol = Chem.RWMol(mcs_mol)

rwconf = Chem.Conformer(rwmol.GetNumAtoms())
matches = rwmol.GetSubstructMatch(mcs_mol)

ref_conf = mol1.GetConformer()
for i, match in enumerate(matches):
    print(ref_conf.GetAtomPosition(ref_match[i]).x)
    # Added atom position information from reference molecule
    rwconf.SetAtomPosition(match, ref_conf.GetAtomPosition(ref_match[i]))
rwmol.AddConformer(rwconf)

Check reference molecule and query molecule structure. I made two molobjects one is non protected and the other is protected atom excepting phenyl ring.

 
capivasertib = Chem.MolFromSmiles('c1cc(ccc1[C@H](CCO)NC(=O)C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)Cl')

protected_capivasertib = copy.deepcopy(capivasertib)
atoms = protected_capivasertib.GetAtoms()
phenyl = Chem.MolFromSmiles('c1ccccc1')
mactches = protected_capivasertib.GetSubstructMatches(phenyl)
arr = np.array(mactches)
matches = arr.flatten()
for atom in atoms:
    if atom.GetIdx() not in matches:
        atom.SetProp('_protected', '1')
capivasertib
capivasertib

Let’s check it.

Enumerated hetero shuffled derivative from non protected and protected molecules. Then use ConstrainEmbed method.

Lots of molecules are generated from non protected molecule!

 
enume1 = list(enumerate_heterocycles(capivasertib))
enume2 = list(enumerate_heterocycles(protected_capivasertib))
print(len(enume1), len(enume2))
> 2592 9

And following results shows effect of “_protected” prop. It is very useful I think. RDKit has many cool features for chemoinformatics.

sourcecode language=”python” wraplines=”false” collapse=”false”]
Draw.MolsToGridImage(enume1[:10], molsPerRow=5)
Draw.MolsToGridImage(enume2[:10], molsPerRow=5)
[/sourcecode]

NON PROTECTED
PROTECTED

Lower figure shows hetero shuffled molecules at only phenyl rings.

I uploaded today’s code to my gist and repo.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/Protect%20and%20enumerate%20heterocycles.ipynb

example

Current version Openforcefield supports rdkit #RDKit #Openforcefield #chemoinformatics

I posted about openforcefield(OpenFF) before. You know, old version of openff supports only OpenEyeTK but current version supports RDKit too.

It is worth to know that we can use openff with open source tool kit. I really appreciate developer’s work! It is great. Today I use the package and ipymol which can control pymol in ipython session. It means that you can control pymol from jupyter notebook! My example code is below.

import openforcefield as off
from rdkit import Chem
from rdkit.Chem import AllChem
from simtk import openmm, unit
from simtk.openmm import app
from openforcefield.topology import Topology
from openforcefield.topology import Molecule
from openforcefield.typing.engines.smirnoff import ForceField
from rdkit.Chem.Draw import IPythonConsole

Load SMIRNOF99FROSS forcefield.

ff = ForceField('test_forcefields/smirnoff99Frosst.offxml')

Then define get_energy function.

def get_energy(system, positions):
    integrator = openmm.VerletIntegrator(1.0 * unit.femtoseconds)
    context = openmm.Context(system, integrator)
    context.setPositions(positions)
    state = context.getState(getEnergy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    return energy

Make sample mol object with rdkit and convert openff mol object. It is easy, just call from_rdkit!.

rdmol = Chem.MolFromSmiles('c1c(c2sccc2)c(c3c[nH]cc3)oc1')
rdmol = Chem.AddHs(rdmol)
rdmol
ofmol = Molecule.from_rdkit(rdmol)

Then generate conformer with openff function. And get topology.

ofmol.generate_conformers(n_conformers=10)
topology = ofmol.to_topology()

Next, create openMM system with generated topology and get position of one conformer.

org_system = ff.create_openmm_system(topology)
pos = ofmol.conformers[0]

At first, calculate energy with default conformation.

get_energy(org_system, pos)
> Quantity(value=80.93719044789302, unit=kilocalorie/mole)

Next I tried to minimize energy with openMM method.

new_system = ff.create_openmm_system(topology)
new_energy = get_energy(new_system, pos)
from sys import stdout
def minimizeOpenMM(Topology, System, Positions):
    integrator = openmm.LangevinIntegrator(
                                        300.0 * unit.kelvin,
                                        1.0 / unit.picosecond,
                                        2.0 * unit.femtosecond)
                                        #.002 * unit.picoseconds)
    simulation = app.Simulation(Topology, System, integrator)
    simulation.context.setPositions(Positions)
    simulation.minimizeEnergy(tolerance=5.0E-9, maxIterations=2000)
    state =  simulation.context.getState(getPositions=True, getEnergy=True)
    positions =state.getPositions(asNumpy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    
    simulation.reporters.append(app.StateDataReporter(stdout, 1000, step=True, potentialEnergy=True, temperature=True))
    simulation.step(30000)
    print(energy)
    positions = positions / unit.angstroms
    coordlist = list()
    for atom_coords in positions:
        coordlist += [i for i in atom_coords]
    return coordlist, positions

cl, pos=minimizeOpenMM(topology, org_system, pos)

Now I could get minimized position by calling minimizeOpenMM.

Next I updated atom position with optimized atom geometries.

from rdkit.Chem import rdGeometry
from rdkit.Chem.rdchem import Conformer
AllChem.EmbedMolecule(rdmol, useExpTorsionAnglePrefs = True , useBasicKnowledge = True)

conf = rdmol.GetConformer()
w=Chem.SDWriter("test1.sdf")
w.write(rdmol)
w.close()
for i in range(rdmol.GetNumAtoms()):
    conf.SetAtomPosition(i, rdGeometry.Point3D(pos[i][0], pos[i][1],pos[i][2],))
w=Chem.SDWriter("test2.sdf")
w.write(rdmol)
w.close()

I made two SDF one is default conformer and another is minimized conformer.

OK, I use ipymol to communicate pymol and visualize in jupyter notebook.

from ipymol import viewer
viewer.start() # this method launches pymol
viewer.load("test1.sdf")
viewer.load("test2.sdf")
viewer.align("test1","test2")
from ipymol import viewer
viewer.start() # this method launches pymol
viewer.load("test1.sdf")
viewer.load("test2.sdf")
viewer.align("test1","test2")

viewer.ray()
viewer.display()
viewer.deleteAll()
viewer.load("test2.sdf")
viewer.ray()
viewer.display()
viewer.deleteAll()
viewer.load("test1.sdf")
viewer.ray()
viewer.display()

The result seems different to original article. https://www.biorxiv.org/content/early/2018/07/13/286542.full.pdf

Hmm why???

BTW, OpenFF is very attractive package for chemoinformatics I think.

My code can access below.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/openff.ipynb

Adjustment bond length and align molecule to scaffold #RDKit #chemoinformatics

To align rdkitmol object to given scaffold, GenerateDepictionMatching2DStructure is useful to do that. But, somedays ago I asked my colleague that the function could not adjust bond length when scaffold which is read from sdf and it has different bond length of RDKit’s default settings.

I found useful information in rdkit-discuss about rdMolTransforms.

rdMolTransforms.TransformConformer method can change bond length, but all bonds length are changed.

So I considered how to solve the problem. At first, I tried to use rdTransform.SetBondlength method, however the function could not apply the bond in the ring.

So I defined original function to do it. Here is my example code.

At first, I made simple scaffold and changed bond length of the scaffold.

= 0.1
rdMolTransforms.TransformConformer(scaf.GetConformer(), tm)
scaf

Now I could get pretty scaffold. ;)

Write the molecule to SD File.

writer = Chem.SDWriter('scaf.sdf')
writer.write(scaf)
writer.close()

Then I made new molecule to align.

mol = Chem.MolFromSmiles('Nc1ccc(C)cc1')

Next, load scaffold and align without bond length adjustment.

scaf = Chem.SDMolSupplier('scaf.sdf')[0]
cpmol = copy.deepcopy(mol)
AllChem.GenerateDepictionMatching2DStructure(cpmol, scaf)
cpmol

Now I could get strange image…

To fix the issue I defined following function.

fdef bondnormalize(mol, temp):
    AllChem.Compute2DCoords(mol)
    ref_bond = mol.GetBonds()[0]
    ref_length = rdMolTransforms.GetBondLength(mol.GetConformer(), ref_bond.GetBeginAtomIdx(),ref_bond.GetEndAtomIdx() )
    prob_bond = temp.GetBonds()[0]
    prob_length = rdMolTransforms.GetBondLength(temp.GetConformer(), prob_bond.GetBeginAtomIdx(),prob_bond.GetEndAtomIdx() )
    ratio = ref_length / prob_length
    tm = np.zeros((4,4), np.double)
    for i in range(3):
        tm[i,i] = ratio
    rdMolTransforms.TransformConformer(temp.GetConformer(), tm)
    AllChem.GenerateDepictionMatching2DStructure(mol, temp)
    return mol

Check the function.

m=bondnormalize(mol, scaf)

Wow it worked well.

It is rare case that reader would like to do like that. RDKit has many function for chemoinformatics. It really fun!.

Today’s code can read from following URL.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/set_same_bond_length.ipynb

Any comments and advices are highly appreciate.


Open Source Lilly’s Chemoinformatics Package

In 2012, lilly’s researchers published Lilly-MedChem Rules in J. Med. Chem. and disclosed their code on github. After the publication, the rules are used in many applications, papers and chemoinformatics applications. Open source tool made a big impact on chemoinformatics. Several hours ago I found an interesting tweet from @jcheminf.

They reported an algorithm of retro-synthesis. Data driven retrosynthetic analysis is hot topics in chemoinformatics area I think.

The article is published from Lilly and the author uploaded source code on github. URL is below. https://github.com/EliLillyCo/LillyMol

Their implementation is different from Segler’s approach ‘Learning plan to chemical synthesis‘. They do not use machine learning approach but use Reverse Reaction Template (RRT) based approach. RRT defines reaction rules and it extracted from mapped reaction data such as Lowe’s US patent dataset.

At first they made RRT repository and used it for analysis. After making the repository. Researcher inputs query structure to the system, the system will search RRT which is applicable for the query and recode it when matched.
Key point is how to make RRT I think. More details are described in the article.
They benchmarked their system with 919 known drug structures from drug bank. The performance of results seems depends on settings of RRT, radius and support. radi-0 RRT seems more general than radi-2 RRT, it likes ECFP.(Table 2)

After reading the article, I would like to use the code. OK let’s try it!

I have checked the repo last year but the code supports linux only at that time. However now, the code supports not only linux but also OSx. ;-)
It is easy to install the tool-kit. OK let’s install the TK and use it.
For installation, gcc >= 6.2.0 and zlib>=1.2.11 are required, so I installed them with home brew.

iwatobipen$ brew install zlib
iwatobipen$ brew install gcc

Then clone the repository and change ZLIB part in makefile.public.OSX-gcc-8. I installed zlib via Homebrew, so I changed ZLIB to ‘/usr/loca/Cellar/zlib….’.

All code are implemented in C++ and the code does not use any chemoinformatics packages such as RDKit, openbabel and CDK!! @_@

iwatobipen$ git clone https://github.com/EliLillyCo/LillyMol.git
iwatobipen$ cd LillyMol
iwatobipen$ vim makefile.public.OSX-gcc-8
###
-- ZLIB =  /usr/local/opt/zlib/lib
++ ZLIB =  ZLIB = /usr/local/Cellar/zlib/1.2.11/lib
###

Now ready! After makefile change, run the makeall.sh. After wait several minutes, installation will finish. All commands are generated in ./bin/OSX-gcc-8/. There are many commands are provided.

iwatobipen$ cd bin/OSX-gcc-8/
iwatobipen$ ls
activity_consistency	iwcut			msort			ring_extraction		rxn_substructure_search	trxn
common_names		iwdemerit		preferred_smiles	ring_trimming		smiles_mutation		tsubstructure
concat_files		mol2qry			random_smiles		rotatable_bonds		sp3_filter		unique_molecules
fetch_smiles_quick	molecular_scaffold	retrosynthesis		rxn_signature		tautomer_generation	unique_rows
fileconv		molecule_subset		rgroup			rxn_standardize		tp_first_pass

Details of the commands are described in the wiki page.

I checked retrosynthesis code with example data. It is a little difficult to set options for me.

iwatobipen$ cd ./example/retrosynthesis
iwatobipen$ cat 1Cmpds.smi 
> C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O
iwatobipen$  ../../bin/OSX-gcc-8/retrosynthesis -Y all -X kg -X kekule -X ersfrm -a 2 -q f -v -R 1 -I CentroidRxnSmi_1 -P UST:AZUCORS 1Cmpds.smi >log.txt 2>err.txt

Check log.txt and err.txt.

iwatobipen$ cat log.txt
C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O  PARENT
O.Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1  via US03992389_NA CentroidRxnSmi_1 R 1 ALL
Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1  via US03992389_NA CentroidRxnSmi_1 R 1 SPFRM.1
Oc1ccc(NC(=O)C)cc1  via US03992389_NA CentroidRxnSmi_1 R 1
O=C  via US03992389_NA CentroidRxnSmi_1 R 1
O=C(NC1CCNCC1)c1ccccc1  via US03992389_NA CentroidRxnSmi_1 R 1

iwatobipen$ cat err.txt
Will not write product fragments with fewer than 2 atoms
Will keep going after an individual test failure
Will preserve Kekule forms
Will use the reaction file name as the reaction name
Reading reactions took 0 seconds
read mol smi eof
Read 1 molecules, 1 deconstructed
1 molecules deconstructed at radius 1
0 deconstructions done at radius 0
1 deconstructions done at radius 1
Set_of_Reactions: CentroidRxnSmi_1 with 164 reactions
2 molecules deconstructed at radius 1
2 molecules deconstructed
Set_of_Reactions: CentroidRxnSmi_1 with 164 reactions
 1 US03947458_NA 1 searches, 0 matches found
 1 US03947473_NA 1 searches, 0 matches found
 1 US03989717_NA 1 searches, 0 matches found
 ----- snip ;-) ------
 1 US20160002218A1_0322 1 searches, 0 matches found
 1 US20160200725A1_0864 1 searches, 0 matches found
2 molecules deconstructed at radius 1
2 molecules deconstructed
163 reactions had 0 hits
1 reactions had 1 hits

It is difficult to understand smiles strings directly for me, OK let’s visualize with RDKit!

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
parent = Chem.MolFromSmiles('C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O')
mol1 = Chem.MolFromSmiles('O.Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1')
mol2 = Chem.MolFromSmiles('Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1')
mol3 = Chem.MolFromSmiles('Oc1ccc(NC(=O)C)cc1')
mol4 = Chem.MolFromSmiles('O=C')
mol5 = Chem.MolFromSmiles('O=C(NC1CCNCC1)c1ccccc1')
Draw.MolsToGridImage([parent, mol1, mol2, mol3, mol4, mol5])
Single step retro synthesis with Liily’s TK

Example code worked without any problems. But it failed when I used original molecule as a query. One of the reason is that I used very limited training data. I used default data for my test. It has only 164 rxns.

I would like to try to make RRT with large data set.

Make interactive chemical space plot in jupyter notebook #cheminformatics #Altair

I often use seaborn for data visualization. With the library, user can make beautiful visualization.
BTW, today I tried to use another library that can make interactive plot in jupyter notebook.
Name of the library is ‘altair’.
https://altair-viz.github.io/index.html
The library can be installed from pip or conda and this package based vega and vega-lite. Vega is a python package for data visualization.

Regarding the tutorial, Altair can make beautiful plow with very simple code. I wrote an example that plot chemical space of cdk2.sdf which is provided by RDKit.

Following code is conducted in Google colab.
At first install rdkit in my environment. Fortunately Altair can call directly without installation to colab.

!wget https://repo.anaconda.com/miniconda/Miniconda3-4.5.1-Linux-x86_64.sh
!chmod +x Miniconda3-4.5.1-Linux-x86_64.sh
!time bash ./Miniconda3-4.5.1-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

After installation of RDKit, append rdkit path to sys path,

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')

Now ready. Let’s import some libraries.

import pandas as pd
import numpy as np
import altair as alt

from rdkit import Chem
from rdkit import rdBase
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import PandasTools
from rdkit.Chem import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA

Then load dataset

moldf = PandasTools.LoadSDF(os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf'))
moldf['SMILES'] = moldf.ROMol.apply(Chem.MolToSmiles)
def mol2fparr(mol):
    arr = np.zeros((0,))
    fp = AllChem.GetMorganFingerprintAsBitVect(mol,2)
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

The conduct PCA with molecular fingerprint for plot chemical space.
And make new dataset. cdk2.sdf has Cluster number, so I used the annotation for coloring.

pca = PCA(n_components=2)
X = np.asarray([mol2fparr(mol) for mol in moldf.ROMol])
print(X.shape)
res = pca.fit_transform(X)
print(res.shape)
moldf['PCA1'] = res[:,0]
moldf['PCA2'] = res[:,1]
moldf2 = moldf[['ID', 'PCA1', 'PCA2', 'SMILES' ]]
moldf2['Cluster'] = ["{:0=2}".format(int(cls)) for cls in moldf.loc[:,'Cluster']]

To make scatter plot in Altair, it is easy just call ‘alt.Cahrt.mark_point(pandas data frame)’
mark_* is the method which can access many kids of plots.
From the document, following plots are provided.

Mark Name Method Description Example
area mark_area() A filled area plot. Simple Stacked Area Chart
bar mark_bar() A bar plot. Simple Bar Chart
circle mark_circle() A scatter plot with filled circles. One Dot Per Zipcode
geoshape mark_geoshape() A geographic shape Choropleth Map
line mark_line() A line plot. Simple Line Chart
point mark_point() A scatter plot with configurable point shapes. Multi-panel Scatter Plot with Linked Brushing
rect mark_rect() A filled rectangle, used for heatmaps Simple Heatmap
rule mark_rule() A vertical or horizontal line spanning the axis. Candlestick Chart
square mark_square() A scatter plot with filled squares. N/A
text mark_text() A scatter plot with points represented by text. Bar Chart with Labels
tick mark_tick() A vertical or horizontal tick mark. Simple Strip Plot

Now I would like to make scatter plot, so I call mark_point.
“interactive()” method returns interactive plot in jupyter.
So After run the code, I can see interactive plot in notebook the plot returns tooltip when mouse over the point.

alt.Chart(moldf2).mark_point().encode(
           x = 'PCA1',
           y = 'PCA2',
           color = 'Cluster',
           tooltip = ['ID', 'SMILES']).interactive()

This library is interesting for me because it is easy to implement tooltip. I tried to embed SVG image to tooltip but it did not work. I would like to know how to embed image to the tooltip if it possible.

How to visualize your data is important because it receives different impression from different visualization even if data is same.

Reader who is interested in the post can found whole code from google colab or github. ;-)
https://colab.research.google.com/drive/1hKcWRBcQG51eGsbpDBF2gl6CoZsmVvTs
https://github.com/iwatobipen/playground/blob/master/plot_chemicalspace.ipynb

Build stacking Classification QSAR model with mlxtend #chemoinformatics #mlxtend #RDKit

I posed about the ML method named ‘blending’ somedays ago. And reader recommended me that how about try to use “mlxtend”.
When I learned ensemble learning package in python I had found it but never used.
So try to use the library to build model.
Mlxtend is easy to install and good document is provided from following URL.
http://rasbt.github.io/mlxtend/

Following code is example for stacking.
In ipython notebook…
Use base.csv for test and load some functions.

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
import pandas as pd

df = pd.read_csv("bace.csv")

The dataset has pIC50 for objective value.

mols = [Chem.MolFromSmiles(smi) for smi in df.mol]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol,2, nBits=1024) for mol in mols]
pIC50 = [i for i in df.pIC50]
Draw.MolsToGridImage(mols[:10], legends=["pIC50 "+str(i) for i in pIC50[:10]], molsPerRow=5)

Images of compounds is below.


Then calculate molecular fingerprint. And make binary activity array as y_bin.

X = []
for fp in fps:
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    X.append(arr)
X = np.array(X)
y = np.array(pIC50)
y_bin = np.asarray(y>7, dtype=np.int)

Then load some classifier model from sklearn and split data for training and testing.

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from mlxtend.classifier import StackingClassifier
from mlxtend.plotting import plot_decision_regions
from mlxtend.plotting import plot_confusion_matrix
import numpy as np
x_train, x_test, y_train, y_test = train_test_split(X,y_bin, test_size=0.2)

To make stacking classifier, it is very simple just call StackingClassifier and set classifier and meta_classifier as arguments.
I use SVC as meta_classifier.

clf1 = RandomForestClassifier(random_state=794)
clf2 = GaussianNB()
clf3 = XGBClassifier(random_state=0)
clf4 = SVC(random_state=0)
clflist = ["RF", "GNB", "XGB", "SVC", "SCLF"]

sclf = StackingClassifier(classifiers=[clf1,clf2,clf3], meta_classifier=clf4)

Then let’s learn the data!

skf = StratifiedKFold(n_splits=5)
for j, (train_idx,test_idx) in enumerate(skf.split(x_train, y_train)):
    for i, clf in enumerate([clf1, clf2, clf3, clf4, sclf]):
        clf.fit(x_train[train_idx],y_train[train_idx])
        ypred = clf.predict(x_train[test_idx])
        acc = accuracy_score(y_train[test_idx], ypred)
        b_acc = balanced_accuracy_score(y_train[test_idx], ypred)
        print("round {}".format(j))
        print(clflist[i])
        print("accuracy {}".format(acc))
        print("balanced accuracy {}".format(b_acc))
        print("="*20)

> output
round 0
RF
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
====================
round 0
GNB
accuracy 0.6625514403292181
balanced accuracy 0.680450351191296
====================
round 0
XGB
accuracy 0.8271604938271605
balanced accuracy 0.8136275995042005
====================
round 0
SVC
accuracy 0.7325102880658436
balanced accuracy 0.7072717256576229
====================
round 0
SCLF
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
====================
round 1
RF
accuracy 0.7603305785123967
balanced accuracy 0.7534683684794672
====================
round 1
GNB
accuracy 0.640495867768595
balanced accuracy 0.6634988901220866
====================
round 1
XGB
accuracy 0.8140495867768595
balanced accuracy 0.8127081021087681
====================
round 1
SVC
accuracy 0.756198347107438
balanced accuracy 0.7414678135405106
===================
.....

Reader who is interested in stacking, you can find nice document here
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#example-1-simple-stacked-classification

And my all code is uploaded to myrepo on github.
http://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/mlxtend_test.ipynb

Mlxtend has many functions for building, analyzing and visualizing machine learning model and data. I will use the package more and more.