Python package for Ensemble learning #Chemoinformatics #Scikit learn

Ensemble learning is a technique for machine learning. I wrote post about blending learning before. URL is below.
I implemented the code by myself at that time.

Ensemble learning sometime outperform than single model. So it is useful for try to use the method. Fortunately now we can use ensemble learning very easily by using a python package named ‘ML-Ens‘ Installation is very easy, only use pip command common way for pythonista I think ;)

After installing the package user can build and train ensemble learning model with few lines. I would like to introduce two example of them one is stacking method and the other is a blending method. OK let’s go to code.

At first, load dataset and make input features. I used morgan fingerprint as input data.

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit import RDPaths
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import PandasTools
import numpy as np
import pandas as pd
from IPython.display import HTML
traindf = PandasTools.LoadSDF(os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf'))
testdf = PandasTools.LoadSDF(os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf'))
# Chek data

cls2lab = {'(A) low':0, '(B) medium':1, '(C) high':2}

def fp2np(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
trainfp = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in traindf.ROMol]
testfp =  [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in testdf.ROMol]
trainX = np.array([fp2np(fp) for fp in trainfp])
testX = np.array([fp2np(fp) for fp in testfp])
trainY = np.array([cls2lab[i] for i in traindf.SOL_classification.to_list()])
testY =  np.array([cls2lab[i] for i in testdf.SOL_classification.to_list()])

Then import several package for ensemble learning. SuperLearner is class for stacking and BlendEnsemble is class for blending.

Making ensemble model is easy. Just use add method to layer addition and finally call add_meta method for adding final prediction layer.

from mlens.ensemble import SuperLearner
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, accuracy_score
from sklearn.svm import SVR, SVC

# For stacnking
ensemble = SuperLearner(scorer=accuracy_score, random_state=794, verbose=2)
ensemble.add([RandomForestClassifier(n_estimators=100, random_state=794), SVC(gamma='auto', C=1000)])
ensemble.add_meta(LogisticRegression(solver='lbfgs', multi_class='auto')), trainY)
pred = ensemble.predict(testX)
accuracy_score(testY, pred)

# Blending
from mlens.ensemble import BlendEnsemble
ensemble2 = BlendEnsemble(scorer=accuracy_score, test_size=0.2, verbose=2)
ensemble2.add([RandomForestClassifier(n_estimators=794, random_state=794),
ensemble2.add_meta(LogisticRegression(solver='lbfgs', multi_class='auto')), trainY)
pred_b = ensemble2.predict(testX)
accuracy_score(pred_b, testY)

Also more models can added with add method. I uploaded whole code on my gist. After calling fit, it is easy to access result data by using data method.

code example

Unfortunately the ensemble models described in the post don’t outperform single random forest model but mlens is nice tool for ensemble learning there is still room of improvement for model performance such as kind of models, hyper parameters etc.

Original document give more informations. Please go to link if reader has interest.

Small molecule MD with openMM #MD #Openforcefield

I updated openforcefield from ver 0.5 to ver 0.6. ForceField of SMIRNOFF is also updated.

I tried to use new version of OpenFF.
At first, I calculated partial charge with semi empirical method ‘AM1-BCC’. Ambertools is used for the calculation, it is easy.

from openforcefield.topology import Molecule
from openforcefield.utils.toolkits import RDKitToolkitWrapper, AmberToolsToolkitWrapper
from openforcefield.topology import Topology
from openforcefield.typing.engines.smirnoff import ForceField
biar = Molecule.from_smiles('c1ccccc1-c1c(C)ccnc1')
#Gerates conformers, default number of generated conformers is 10.

Just finished, check the result. Nitrogen has the most negative charge and neighbor aromatic carbons has positive charges.

for i, atm in enumerate(biar.atoms):
    print(pc[i], atm)
-0.1175 e 
-0.1305 e 
-0.125 e 
-0.1305 e 
-0.1175 e 
-0.036 e 
-0.1543 e 
-0.0243 e 
-0.0648 e 
-0.2513 e 
0.3952 e 
-0.668 e 
0.4062 e 
0.136 e 
0.1335 e 
0.133 e 
0.1335 e 
0.136 e 
0.0527 e 
0.0527 e 
0.0527 e 
0.143 e 
0.0221 e 
0.0251 e 

It seems work fine. OK let’s try to MD calculation.

For convenience, I wrote simple script and config file for calculation.
Following code calculate MD with SMILES as sys.argv[1]
import yaml
import sys
import os
import time
import matplotlib.pyplot as plt
from openforcefield.topology import Molecule
from openforcefield.topology import Topology
from openforcefield.typing.engines.smirnoff import ForceField
from openforcefield.utils.toolkits import RDKitToolkitWrapper
from openforcefield.utils.toolkits import AmberToolsToolkitWrapper
from simtk import openmm
from simtk import unit
from rdkit import Chem

def run_md(molecule, confId=0):
    off_topology = molecule.to_topology()
    omm_topology = off_topology.to_openmm()
    system = forcefield.create_openmm_system(off_topology)

    time_step = config["time_step"] * unit.femtoseconds
    temperature = config["temperature"] * unit.kelvin
    friction = 1 / unit.picosecond
    integrator = openmm.LangevinIntegrator(temperature, friction, time_step)
    conf = molecule.conformers[confId]
    simulation =,
    if not os.path.isdir('./log'):
    pdb_reporter ='./log/trj.pdb', config["trj_freq"])
    state_data_reporter ="./log/data.csv",
    start = time.process_time()
    end = time.process_time()
    print(f"Elapsed time {end-start:.2f} sec")

if __name__=="__main__":
    forcefield = ForceField("openff-1.0.0.offxml")
    config = yaml.load(open("mdconf.yml", "r"), yaml.Loader)
    molecule = Molecule.from_smiles(sys.argv[1])

And calculation configuration is below.

time_step: 2
temperature: 300
friction: 1
trj_freq: 1
data_freq: 1
num_steps: 1000

Run calculation.
$ python ‘c1ccc(C)cc1-c2c(OC)nccc2’

After the calculation, I could get pdb and csv file.
Pdb file has 1000 states. And CSV file has calculated data.

blue shows energy and red shows temperature

It took ~10 sec for the molecule, it will take long time for large scale calculation.

MD calculation requires many parameters. I’m not familiar for the calculation so started to learn it. Now I installed GROMACS in my PC.

There are lots of things what I would like to learn….

Useful package for descriptor calculation #chemoinformatics #rdkit

Descriptor calculation is an important task for chemoinfomatics. I often use rdkit to do it. And today I found very useful package for descriptor calculation which name is descriptorus. URL is below.

It is very easy to install the package. Just following command.

pip install git+

After did it, I could use the package.

By using the package, following descriptors are calculated very efficiently.

* atompaircounts
* morgan3counts
* morganchiral3counts
* morganfeature3counts
* rdkit2d
* rdkit2dnormalized
* rdkitfpbits

I tried to use the package. Very simple example.

from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
gen1 = MakeGenerator((‘rdkit2d’,))
smi = ‘c1ccncc1’

Of course it is easy to get column names.

for col in gen1.GetColumns():


The first row indicates whether the calculation is successful or not.

And the package provides normalized descriptors which are useful for machine learning. It is easy to get it.

from descriptastorus.descriptors import rdDescriptors
from descriptastorus.descriptors import rdNormalizedDescriptors
gen2 = rdDescriptors.RDKit2D()
gen3 = rdNormalizedDescriptors.RDKit2DNormalized()
data2 = gen2.process(smi)
data3 = gen3.process(smi)
for i in range(len(data2)):
print(data2[i], data3[i])
True True
3.000000000000001 0.9749367759562906
75.86113958768547 0.0018713336795980954
4.242640687119286 0.0001226379059079931
3.333964941448087 0.00012708489238074725
3.333964941448087 9.889979057774486e-05
3.0 0.00031922002005261777
1.8497311128276561 0.0003685049537727884
1.8497311128276561 0.0005439421913480759

Descriptorus can make a DescriptaStore. I failed it because I couldn’t kyotocabinet in my env….

In summary, the package is very useful for chemoinformatician because user can get many rdkit’s descriptors only typing few lines.

It is amazing for me that we can use such an useful open source packages very easily.

Thanks for OSS developers!

Today’s gist is below.

Calculate solvent effect in Psi4 #psi4 #quantumchemistry

Recently I use not only chemoinformatics tools but also quantum chemistry tool, my favorite is Psi4. Psi4 has many options and plug-ins for quantum calculation. Most setting of calculation is vacuum, but it actually true. So considering the solvent around the molecules is important.

Can psi4 perform calculation with solvent effect?


PCMSolver is plugin for PCM calculation. It can be installed via conda.

$ conda install -c psi4 pcmsolver

PCM means ‘polarizable continuum model’. PCM describes solute-solvent interactions only by means of electrostatics and polarization between the solute molecule and solvent. The solvent is modeled as a dielectric continuum with a given permittivity ε.

To use PCMsolver in psi4, user need to make specific input file for the calculation which should has kind of solvent, basis function, CPCM or IEFPCM, etc.

I made to input file from toulene.mol file which is retrieved from ChEMBL.

input files are below. PCMsolver can not call from python interpreter so I need to make input.dat.

# toluene in water
#! pcm

molecule mol {

0 1
C -2.0705902285428452 0.09342117730391691 0.21438842743524625
C 0.06881622580040792 1.2107631870687914 0.004377262725162135
C -0.043709196272760036 -1.211640074147377 -0.026831077837745607
C -1.3202142335884217 1.271231179761799 0.15648583022799556
C 0.714364020910022 -0.03184758851751915 -0.10370200526939809
C 2.2076605753403253 -0.0996279156897954 -0.22684462930707974
C -1.4325455046382927 -1.1469904802166477 0.1253309895373383
H -3.1452716192643337 0.14180771794289201 0.3335057008466673
H 0.6396267675023236 2.1305648085368674 -0.0295934787763336
H 0.4394292831589224 -2.1792111515476216 -0.08511724686650844
H -1.8144108252945523 2.2311166905139106 0.23329105314598383
H 2.59680405039162 0.7889905119305699 -0.7681174009541791
H 2.5134086150024317 -1.0062726983460601 -0.7912601501934461
H 2.660308829389096 -0.1337122631434785 0.78606242092797
H -2.013676759894258 -2.058593101450173 0.17802430435839692
units angstrom


set {
basis 6-31G**
scf_type pk
pcm true
pcm_scf_type total

pcm = {
Units = Angstrom
Medium {
SolverType = IEFPCM
Solvent = Water

Cavity {
RadiiSet = UFF
Type = GePol
Scaling = false
Area = 0.3
Mode = Implicit

energy_pte, wfn = energy(‘scf’, return_wfn=True)

# toluene in benzene
#! pcm

molecule mol {

0 1
C -2.0705902285428452 0.09342117730391691 0.21438842743524625
C 0.06881622580040792 1.2107631870687914 0.004377262725162135
C -0.043709196272760036 -1.211640074147377 -0.026831077837745607
C -1.3202142335884217 1.271231179761799 0.15648583022799556
C 0.714364020910022 -0.03184758851751915 -0.10370200526939809
C 2.2076605753403253 -0.0996279156897954 -0.22684462930707974
C -1.4325455046382927 -1.1469904802166477 0.1253309895373383
H -3.1452716192643337 0.14180771794289201 0.3335057008466673
H 0.6396267675023236 2.1305648085368674 -0.0295934787763336
H 0.4394292831589224 -2.1792111515476216 -0.08511724686650844
H -1.8144108252945523 2.2311166905139106 0.23329105314598383
H 2.59680405039162 0.7889905119305699 -0.7681174009541791
H 2.5134086150024317 -1.0062726983460601 -0.7912601501934461
H 2.660308829389096 -0.1337122631434785 0.78606242092797
H -2.013676759894258 -2.058593101450173 0.17802430435839692
units angstrom


set {
basis 6-31G**
scf_type pk
pcm true
pcm_scf_type total

pcm = {
Units = Angstrom
Medium {
SolverType = IEFPCM
Solvent = Benzene

Cavity {
RadiiSet = UFF
Type = GePol
Scaling = false
Area = 0.3
Mode = Implicit

energy_pte, wfn = energy(‘scf’, return_wfn=True)

PCMsolver can has many kinds of solvents listed below.

    2:'Propylene Carbonate',
    18:'Carbon Tetrachloride',

After making the input files, following step is as same as default psi4 calculation.

$ psi4 input.dat -o output.dat

After waiting few seconds, I could get two output results. Picked up line 355-363.

## toulene in water
@RHF Final Energy: -269.75544369636918

=> Energetics Energetics <=

Nuclear Repulsion Energy = 268.6733324038011688
One-Electron Energy = -895.7789138809855558
Two-Electron Energy = 357.3548941675961714
PCM Polarization Energy = -0.0019460266422257
Total Energy = -269.7526333362304172

Computation Completed

There is difference in PCM Polarization Energy between in water and in toluene, in water it is -0.0052558561183017A.U.(-3.298kcal/mol) and in tolunene, it is -0.0019460266422257A.U.(-1.22kcal/mol)
*1 a.u. =627.50 kcal/mol

I would like to test some molecules. Are there freely available experimental data set? I’ll try to find them.

Psikit update/Draw ESP, HOMO LUMO #RDKit #Chemoinformatics #quantumchemistry

I just updated psikit which is package for quantum-chemoinformatics ;)

It can be installed from conda / pypi :)

I added and updated new function for molecular property rendering.

Current version of psikit can draw not only frontier orbital but also ESP and dual descriptor. Dual descriptor is calculated by psi4. What is dual descriptor? From original ducment.

Calculates the dual descriptor from frontier orbitals: 𝑓2(𝐫)=𝜌LUMO(𝐫)−𝜌HOMO(𝐫)f2(r)=ρLUMO(r)−ρHOMO(r). The dual descriptor is a good measure of nucleophilicity and electrophilicity, containing information essentially equivalent to both Fukui functions combined. More details on the dual descriptor itself can be found in [Morell:2005:205], while the current implementation is described in [Martinez-Araya:2015:451]. This feature is currently only supported for closed shell systems.

It is very easy to get these images!

Let’s test the functions with acetic acid as an example.

Following code is almost borrowed form the UGM material. Thanks for sharing nice code. Import packages, read Reaction data and reaction objects at first. For convenience, I recommend to install ipymol at first.

import ipymol
from psikit import Psikit
pk = Psikit()
v = ipymol.viewer
v.start() # pymol will launch

Calculate energy of acetic acid.

Then call getMOview() for getting some cube files.


Now data preparation is finished. I can get several views just call view_on_pymol with target option which I would like to draw in pymol.

pk.view_on_pymol(target='DUAL', maprange=0.001)

Now I could get 3 views in pymol. Following images are results.


Visualization of quantum chemistry properties are useful for medchem I think. Any comments, requests and suggestions are greatly appreciated.

code example

Today’s code is uploaded my repo and URL is below.

Enumerate partial heteroaromatic rings in a molecule #RDKit #Chemoinformatics

I posted hetero shuffling before. It worked well but redundant. There is a nice code in RDKit UGM2017 material. URL is below.

The code defined transformation with hard coding and seems nice.

In case of real project, we sometime would like to do enumeration against partial substructure not all structure. I thought how to do it.

Fortunately RDKit can do it by setting “_protected” property of Atoms. It is worth to know (you know, the approach is described in RDKit document of course!).

Following code is almost borrowed form the UGM material. Thanks for sharing nice code. Import packages, read Reaction data and reaction objects at first.

from __future__ import print_function
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
import copy
import numpy as np

import pandas as pd
csvfile = './data/heterocycle_reactions.csv'

import csv
smarts_reader = csv.DictReader(open(csvfile))
for row in smarts_reader:
    smarts = row['SMARTS']
    if not smarts:

    for product in row['CONVERT_TO'].split(','):
        reaction = smarts + '>>' + product

Then define some functions. I used mol object as an input directly instead of SMILES.

, isomericSmiles=True)
if isosmi in unique:
yield newmol[0]

def enumerate_heterocycles(mol):
start = mol
starting_points = [start]
seen = set()
while starting_points:
for newmol in get_unique_products(starting_points.pop()):
newmol_smiles = Chem.MolToSmiles(newmol)
if newmol_smiles in seen:
yield newmol

Now ready to check it.

I used capivasertib which is kinase inhibitor as an example.

rwmol = Chem.RWMol(mcs_mol)

rwconf = Chem.Conformer(rwmol.GetNumAtoms())
matches = rwmol.GetSubstructMatch(mcs_mol)

ref_conf = mol1.GetConformer()
for i, match in enumerate(matches):
    # Added atom position information from reference molecule
    rwconf.SetAtomPosition(match, ref_conf.GetAtomPosition(ref_match[i]))

Check reference molecule and query molecule structure. I made two molobjects one is non protected and the other is protected atom excepting phenyl ring.

capivasertib = Chem.MolFromSmiles('c1cc(ccc1[C@H](CCO)NC(=O)C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)Cl')

protected_capivasertib = copy.deepcopy(capivasertib)
atoms = protected_capivasertib.GetAtoms()
phenyl = Chem.MolFromSmiles('c1ccccc1')
mactches = protected_capivasertib.GetSubstructMatches(phenyl)
arr = np.array(mactches)
matches = arr.flatten()
for atom in atoms:
    if atom.GetIdx() not in matches:
        atom.SetProp('_protected', '1')

Let’s check it.

Enumerated hetero shuffled derivative from non protected and protected molecules. Then use ConstrainEmbed method.

Lots of molecules are generated from non protected molecule!

enume1 = list(enumerate_heterocycles(capivasertib))
enume2 = list(enumerate_heterocycles(protected_capivasertib))
print(len(enume1), len(enume2))
> 2592 9

And following results shows effect of “_protected” prop. It is very useful I think. RDKit has many cool features for chemoinformatics.

sourcecode language=”python” wraplines=”false” collapse=”false”]
Draw.MolsToGridImage(enume1[:10], molsPerRow=5)
Draw.MolsToGridImage(enume2[:10], molsPerRow=5)


Lower figure shows hetero shuffled molecules at only phenyl rings.

I uploaded today’s code to my gist and repo.


Current version Openforcefield supports rdkit #RDKit #Openforcefield #chemoinformatics

I posted about openforcefield(OpenFF) before. You know, old version of openff supports only OpenEyeTK but current version supports RDKit too.

It is worth to know that we can use openff with open source tool kit. I really appreciate developer’s work! It is great. Today I use the package and ipymol which can control pymol in ipython session. It means that you can control pymol from jupyter notebook! My example code is below.

import openforcefield as off
from rdkit import Chem
from rdkit.Chem import AllChem
from simtk import openmm, unit
from simtk.openmm import app
from openforcefield.topology import Topology
from openforcefield.topology import Molecule
from openforcefield.typing.engines.smirnoff import ForceField
from rdkit.Chem.Draw import IPythonConsole

Load SMIRNOF99FROSS forcefield.

ff = ForceField('test_forcefields/smirnoff99Frosst.offxml')

Then define get_energy function.

def get_energy(system, positions):
    integrator = openmm.VerletIntegrator(1.0 * unit.femtoseconds)
    context = openmm.Context(system, integrator)
    state = context.getState(getEnergy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    return energy

Make sample mol object with rdkit and convert openff mol object. It is easy, just call from_rdkit!.

rdmol = Chem.MolFromSmiles('c1c(c2sccc2)c(c3c[nH]cc3)oc1')
rdmol = Chem.AddHs(rdmol)
ofmol = Molecule.from_rdkit(rdmol)

Then generate conformer with openff function. And get topology.

topology = ofmol.to_topology()

Next, create openMM system with generated topology and get position of one conformer.

org_system = ff.create_openmm_system(topology)
pos = ofmol.conformers[0]

At first, calculate energy with default conformation.

get_energy(org_system, pos)
> Quantity(value=80.93719044789302, unit=kilocalorie/mole)

Next I tried to minimize energy with openMM method.

new_system = ff.create_openmm_system(topology)
new_energy = get_energy(new_system, pos)
from sys import stdout
def minimizeOpenMM(Topology, System, Positions):
    integrator = openmm.LangevinIntegrator(
                                        300.0 * unit.kelvin,
                                        1.0 / unit.picosecond,
                                        2.0 * unit.femtosecond)
                                        #.002 * unit.picoseconds)
    simulation = app.Simulation(Topology, System, integrator)
    simulation.minimizeEnergy(tolerance=5.0E-9, maxIterations=2000)
    state =  simulation.context.getState(getPositions=True, getEnergy=True)
    positions =state.getPositions(asNumpy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    simulation.reporters.append(app.StateDataReporter(stdout, 1000, step=True, potentialEnergy=True, temperature=True))
    positions = positions / unit.angstroms
    coordlist = list()
    for atom_coords in positions:
        coordlist += [i for i in atom_coords]
    return coordlist, positions

cl, pos=minimizeOpenMM(topology, org_system, pos)

Now I could get minimized position by calling minimizeOpenMM.

Next I updated atom position with optimized atom geometries.

from rdkit.Chem import rdGeometry
from rdkit.Chem.rdchem import Conformer
AllChem.EmbedMolecule(rdmol, useExpTorsionAnglePrefs = True , useBasicKnowledge = True)

conf = rdmol.GetConformer()
for i in range(rdmol.GetNumAtoms()):
    conf.SetAtomPosition(i, rdGeometry.Point3D(pos[i][0], pos[i][1],pos[i][2],))

I made two SDF one is default conformer and another is minimized conformer.

OK, I use ipymol to communicate pymol and visualize in jupyter notebook.

from ipymol import viewer
viewer.start() # this method launches pymol
from ipymol import viewer
viewer.start() # this method launches pymol


The result seems different to original article.

Hmm why???

BTW, OpenFF is very attractive package for chemoinformatics I think.

My code can access below.