Useful package for descriptor calculation #chemoinformatics #rdkit

Descriptor calculation is an important task for chemoinfomatics. I often use rdkit to do it. And today I found very useful package for descriptor calculation which name is descriptorus. URL is below.

It is very easy to install the package. Just following command.

pip install git+

After did it, I could use the package.

By using the package, following descriptors are calculated very efficiently.

* atompaircounts
* morgan3counts
* morganchiral3counts
* morganfeature3counts
* rdkit2d
* rdkit2dnormalized
* rdkitfpbits

I tried to use the package. Very simple example.

from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
gen1 = MakeGenerator((‘rdkit2d’,))
smi = ‘c1ccncc1’

Of course it is easy to get column names.

for col in gen1.GetColumns():


The first row indicates whether the calculation is successful or not.

And the package provides normalized descriptors which are useful for machine learning. It is easy to get it.

from descriptastorus.descriptors import rdDescriptors
from descriptastorus.descriptors import rdNormalizedDescriptors
gen2 = rdDescriptors.RDKit2D()
gen3 = rdNormalizedDescriptors.RDKit2DNormalized()
data2 = gen2.process(smi)
data3 = gen3.process(smi)
for i in range(len(data2)):
print(data2[i], data3[i])
True True
3.000000000000001 0.9749367759562906
75.86113958768547 0.0018713336795980954
4.242640687119286 0.0001226379059079931
3.333964941448087 0.00012708489238074725
3.333964941448087 9.889979057774486e-05
3.0 0.00031922002005261777
1.8497311128276561 0.0003685049537727884
1.8497311128276561 0.0005439421913480759

Descriptorus can make a DescriptaStore. I failed it because I couldn’t kyotocabinet in my env….

In summary, the package is very useful for chemoinformatician because user can get many rdkit’s descriptors only typing few lines.

It is amazing for me that we can use such an useful open source packages very easily.

Thanks for OSS developers!

Today’s gist is below.


Calculate solvent effect in Psi4 #psi4 #quantumchemistry

Recently I use not only chemoinformatics tools but also quantum chemistry tool, my favorite is Psi4. Psi4 has many options and plug-ins for quantum calculation. Most setting of calculation is vacuum, but it actually true. So considering the solvent around the molecules is important.

Can psi4 perform calculation with solvent effect?


PCMSolver is plugin for PCM calculation. It can be installed via conda.

$ conda install -c psi4 pcmsolver

PCM means ‘polarizable continuum model’. PCM describes solute-solvent interactions only by means of electrostatics and polarization between the solute molecule and solvent. The solvent is modeled as a dielectric continuum with a given permittivity ε.

To use PCMsolver in psi4, user need to make specific input file for the calculation which should has kind of solvent, basis function, CPCM or IEFPCM, etc.

I made to input file from toulene.mol file which is retrieved from ChEMBL.

input files are below. PCMsolver can not call from python interpreter so I need to make input.dat.

# toluene in water
#! pcm

molecule mol {

0 1
C -2.0705902285428452 0.09342117730391691 0.21438842743524625
C 0.06881622580040792 1.2107631870687914 0.004377262725162135
C -0.043709196272760036 -1.211640074147377 -0.026831077837745607
C -1.3202142335884217 1.271231179761799 0.15648583022799556
C 0.714364020910022 -0.03184758851751915 -0.10370200526939809
C 2.2076605753403253 -0.0996279156897954 -0.22684462930707974
C -1.4325455046382927 -1.1469904802166477 0.1253309895373383
H -3.1452716192643337 0.14180771794289201 0.3335057008466673
H 0.6396267675023236 2.1305648085368674 -0.0295934787763336
H 0.4394292831589224 -2.1792111515476216 -0.08511724686650844
H -1.8144108252945523 2.2311166905139106 0.23329105314598383
H 2.59680405039162 0.7889905119305699 -0.7681174009541791
H 2.5134086150024317 -1.0062726983460601 -0.7912601501934461
H 2.660308829389096 -0.1337122631434785 0.78606242092797
H -2.013676759894258 -2.058593101450173 0.17802430435839692
units angstrom


set {
basis 6-31G**
scf_type pk
pcm true
pcm_scf_type total

pcm = {
Units = Angstrom
Medium {
SolverType = IEFPCM
Solvent = Water

Cavity {
RadiiSet = UFF
Type = GePol
Scaling = false
Area = 0.3
Mode = Implicit

energy_pte, wfn = energy(‘scf’, return_wfn=True)

# toluene in benzene
#! pcm

molecule mol {

0 1
C -2.0705902285428452 0.09342117730391691 0.21438842743524625
C 0.06881622580040792 1.2107631870687914 0.004377262725162135
C -0.043709196272760036 -1.211640074147377 -0.026831077837745607
C -1.3202142335884217 1.271231179761799 0.15648583022799556
C 0.714364020910022 -0.03184758851751915 -0.10370200526939809
C 2.2076605753403253 -0.0996279156897954 -0.22684462930707974
C -1.4325455046382927 -1.1469904802166477 0.1253309895373383
H -3.1452716192643337 0.14180771794289201 0.3335057008466673
H 0.6396267675023236 2.1305648085368674 -0.0295934787763336
H 0.4394292831589224 -2.1792111515476216 -0.08511724686650844
H -1.8144108252945523 2.2311166905139106 0.23329105314598383
H 2.59680405039162 0.7889905119305699 -0.7681174009541791
H 2.5134086150024317 -1.0062726983460601 -0.7912601501934461
H 2.660308829389096 -0.1337122631434785 0.78606242092797
H -2.013676759894258 -2.058593101450173 0.17802430435839692
units angstrom


set {
basis 6-31G**
scf_type pk
pcm true
pcm_scf_type total

pcm = {
Units = Angstrom
Medium {
SolverType = IEFPCM
Solvent = Benzene

Cavity {
RadiiSet = UFF
Type = GePol
Scaling = false
Area = 0.3
Mode = Implicit

energy_pte, wfn = energy(‘scf’, return_wfn=True)

PCMsolver can has many kinds of solvents listed below.

    2:'Propylene Carbonate',
    18:'Carbon Tetrachloride',

After making the input files, following step is as same as default psi4 calculation.

$ psi4 input.dat -o output.dat

After waiting few seconds, I could get two output results. Picked up line 355-363.

## toulene in water
@RHF Final Energy: -269.75544369636918

=> Energetics Energetics <=

Nuclear Repulsion Energy = 268.6733324038011688
One-Electron Energy = -895.7789138809855558
Two-Electron Energy = 357.3548941675961714
PCM Polarization Energy = -0.0019460266422257
Total Energy = -269.7526333362304172

Computation Completed

There is difference in PCM Polarization Energy between in water and in toluene, in water it is -0.0052558561183017A.U.(-3.298kcal/mol) and in tolunene, it is -0.0019460266422257A.U.(-1.22kcal/mol)
*1 a.u. =627.50 kcal/mol

I would like to test some molecules. Are there freely available experimental data set? I’ll try to find them.

Psikit update/Draw ESP, HOMO LUMO #RDKit #Chemoinformatics #quantumchemistry

I just updated psikit which is package for quantum-chemoinformatics ;)

It can be installed from conda / pypi :)

I added and updated new function for molecular property rendering.

Current version of psikit can draw not only frontier orbital but also ESP and dual descriptor. Dual descriptor is calculated by psi4. What is dual descriptor? From original ducment.

Calculates the dual descriptor from frontier orbitals: 𝑓2(𝐫)=𝜌LUMO(𝐫)−𝜌HOMO(𝐫)f2(r)=ρLUMO(r)−ρHOMO(r). The dual descriptor is a good measure of nucleophilicity and electrophilicity, containing information essentially equivalent to both Fukui functions combined. More details on the dual descriptor itself can be found in [Morell:2005:205], while the current implementation is described in [Martinez-Araya:2015:451]. This feature is currently only supported for closed shell systems.

It is very easy to get these images!

Let’s test the functions with acetic acid as an example.

Following code is almost borrowed form the UGM material. Thanks for sharing nice code. Import packages, read Reaction data and reaction objects at first. For convenience, I recommend to install ipymol at first.

import ipymol
from psikit import Psikit
pk = Psikit()
v = ipymol.viewer
v.start() # pymol will launch

Calculate energy of acetic acid.

Then call getMOview() for getting some cube files.


Now data preparation is finished. I can get several views just call view_on_pymol with target option which I would like to draw in pymol.

pk.view_on_pymol(target='DUAL', maprange=0.001)

Now I could get 3 views in pymol. Following images are results.


Visualization of quantum chemistry properties are useful for medchem I think. Any comments, requests and suggestions are greatly appreciated.

code example

Today’s code is uploaded my repo and URL is below.

Enumerate partial heteroaromatic rings in a molecule #RDKit #Chemoinformatics

I posted hetero shuffling before. It worked well but redundant. There is a nice code in RDKit UGM2017 material. URL is below.

The code defined transformation with hard coding and seems nice.

In case of real project, we sometime would like to do enumeration against partial substructure not all structure. I thought how to do it.

Fortunately RDKit can do it by setting “_protected” property of Atoms. It is worth to know (you know, the approach is described in RDKit document of course!).

Following code is almost borrowed form the UGM material. Thanks for sharing nice code. Import packages, read Reaction data and reaction objects at first.

from __future__ import print_function
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
import copy
import numpy as np

import pandas as pd
csvfile = './data/heterocycle_reactions.csv'

import csv
smarts_reader = csv.DictReader(open(csvfile))
for row in smarts_reader:
    smarts = row['SMARTS']
    if not smarts:

    for product in row['CONVERT_TO'].split(','):
        reaction = smarts + '>>' + product

Then define some functions. I used mol object as an input directly instead of SMILES.

, isomericSmiles=True)
if isosmi in unique:
yield newmol[0]

def enumerate_heterocycles(mol):
start = mol
starting_points = [start]
seen = set()
while starting_points:
for newmol in get_unique_products(starting_points.pop()):
newmol_smiles = Chem.MolToSmiles(newmol)
if newmol_smiles in seen:
yield newmol

Now ready to check it.

I used capivasertib which is kinase inhibitor as an example.

rwmol = Chem.RWMol(mcs_mol)

rwconf = Chem.Conformer(rwmol.GetNumAtoms())
matches = rwmol.GetSubstructMatch(mcs_mol)

ref_conf = mol1.GetConformer()
for i, match in enumerate(matches):
    # Added atom position information from reference molecule
    rwconf.SetAtomPosition(match, ref_conf.GetAtomPosition(ref_match[i]))

Check reference molecule and query molecule structure. I made two molobjects one is non protected and the other is protected atom excepting phenyl ring.

capivasertib = Chem.MolFromSmiles('c1cc(ccc1[C@H](CCO)NC(=O)C2(CCN(CC2)c3c4cc[nH]c4ncn3)N)Cl')

protected_capivasertib = copy.deepcopy(capivasertib)
atoms = protected_capivasertib.GetAtoms()
phenyl = Chem.MolFromSmiles('c1ccccc1')
mactches = protected_capivasertib.GetSubstructMatches(phenyl)
arr = np.array(mactches)
matches = arr.flatten()
for atom in atoms:
    if atom.GetIdx() not in matches:
        atom.SetProp('_protected', '1')

Let’s check it.

Enumerated hetero shuffled derivative from non protected and protected molecules. Then use ConstrainEmbed method.

Lots of molecules are generated from non protected molecule!

enume1 = list(enumerate_heterocycles(capivasertib))
enume2 = list(enumerate_heterocycles(protected_capivasertib))
print(len(enume1), len(enume2))
> 2592 9

And following results shows effect of “_protected” prop. It is very useful I think. RDKit has many cool features for chemoinformatics.

sourcecode language=”python” wraplines=”false” collapse=”false”]
Draw.MolsToGridImage(enume1[:10], molsPerRow=5)
Draw.MolsToGridImage(enume2[:10], molsPerRow=5)


Lower figure shows hetero shuffled molecules at only phenyl rings.

I uploaded today’s code to my gist and repo.


Current version Openforcefield supports rdkit #RDKit #Openforcefield #chemoinformatics

I posted about openforcefield(OpenFF) before. You know, old version of openff supports only OpenEyeTK but current version supports RDKit too.

It is worth to know that we can use openff with open source tool kit. I really appreciate developer’s work! It is great. Today I use the package and ipymol which can control pymol in ipython session. It means that you can control pymol from jupyter notebook! My example code is below.

import openforcefield as off
from rdkit import Chem
from rdkit.Chem import AllChem
from simtk import openmm, unit
from simtk.openmm import app
from openforcefield.topology import Topology
from openforcefield.topology import Molecule
from openforcefield.typing.engines.smirnoff import ForceField
from rdkit.Chem.Draw import IPythonConsole

Load SMIRNOF99FROSS forcefield.

ff = ForceField('test_forcefields/smirnoff99Frosst.offxml')

Then define get_energy function.

def get_energy(system, positions):
    integrator = openmm.VerletIntegrator(1.0 * unit.femtoseconds)
    context = openmm.Context(system, integrator)
    state = context.getState(getEnergy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    return energy

Make sample mol object with rdkit and convert openff mol object. It is easy, just call from_rdkit!.

rdmol = Chem.MolFromSmiles('c1c(c2sccc2)c(c3c[nH]cc3)oc1')
rdmol = Chem.AddHs(rdmol)
ofmol = Molecule.from_rdkit(rdmol)

Then generate conformer with openff function. And get topology.

topology = ofmol.to_topology()

Next, create openMM system with generated topology and get position of one conformer.

org_system = ff.create_openmm_system(topology)
pos = ofmol.conformers[0]

At first, calculate energy with default conformation.

get_energy(org_system, pos)
> Quantity(value=80.93719044789302, unit=kilocalorie/mole)

Next I tried to minimize energy with openMM method.

new_system = ff.create_openmm_system(topology)
new_energy = get_energy(new_system, pos)
from sys import stdout
def minimizeOpenMM(Topology, System, Positions):
    integrator = openmm.LangevinIntegrator(
                                        300.0 * unit.kelvin,
                                        1.0 / unit.picosecond,
                                        2.0 * unit.femtosecond)
                                        #.002 * unit.picoseconds)
    simulation = app.Simulation(Topology, System, integrator)
    simulation.minimizeEnergy(tolerance=5.0E-9, maxIterations=2000)
    state =  simulation.context.getState(getPositions=True, getEnergy=True)
    positions =state.getPositions(asNumpy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    simulation.reporters.append(app.StateDataReporter(stdout, 1000, step=True, potentialEnergy=True, temperature=True))
    positions = positions / unit.angstroms
    coordlist = list()
    for atom_coords in positions:
        coordlist += [i for i in atom_coords]
    return coordlist, positions

cl, pos=minimizeOpenMM(topology, org_system, pos)

Now I could get minimized position by calling minimizeOpenMM.

Next I updated atom position with optimized atom geometries.

from rdkit.Chem import rdGeometry
from rdkit.Chem.rdchem import Conformer
AllChem.EmbedMolecule(rdmol, useExpTorsionAnglePrefs = True , useBasicKnowledge = True)

conf = rdmol.GetConformer()
for i in range(rdmol.GetNumAtoms()):
    conf.SetAtomPosition(i, rdGeometry.Point3D(pos[i][0], pos[i][1],pos[i][2],))

I made two SDF one is default conformer and another is minimized conformer.

OK, I use ipymol to communicate pymol and visualize in jupyter notebook.

from ipymol import viewer
viewer.start() # this method launches pymol
from ipymol import viewer
viewer.start() # this method launches pymol


The result seems different to original article.

Hmm why???

BTW, OpenFF is very attractive package for chemoinformatics I think.

My code can access below.

Adjustment bond length and align molecule to scaffold #RDKit #chemoinformatics

To align rdkitmol object to given scaffold, GenerateDepictionMatching2DStructure is useful to do that. But, somedays ago I asked my colleague that the function could not adjust bond length when scaffold which is read from sdf and it has different bond length of RDKit’s default settings.

I found useful information in rdkit-discuss about rdMolTransforms.

rdMolTransforms.TransformConformer method can change bond length, but all bonds length are changed.

So I considered how to solve the problem. At first, I tried to use rdTransform.SetBondlength method, however the function could not apply the bond in the ring.

So I defined original function to do it. Here is my example code.

At first, I made simple scaffold and changed bond length of the scaffold.

= 0.1
rdMolTransforms.TransformConformer(scaf.GetConformer(), tm)

Now I could get pretty scaffold. ;)

Write the molecule to SD File.

writer = Chem.SDWriter('scaf.sdf')

Then I made new molecule to align.

mol = Chem.MolFromSmiles('Nc1ccc(C)cc1')

Next, load scaffold and align without bond length adjustment.

scaf = Chem.SDMolSupplier('scaf.sdf')[0]
cpmol = copy.deepcopy(mol)
AllChem.GenerateDepictionMatching2DStructure(cpmol, scaf)

Now I could get strange image…

To fix the issue I defined following function.

fdef bondnormalize(mol, temp):
    ref_bond = mol.GetBonds()[0]
    ref_length = rdMolTransforms.GetBondLength(mol.GetConformer(), ref_bond.GetBeginAtomIdx(),ref_bond.GetEndAtomIdx() )
    prob_bond = temp.GetBonds()[0]
    prob_length = rdMolTransforms.GetBondLength(temp.GetConformer(), prob_bond.GetBeginAtomIdx(),prob_bond.GetEndAtomIdx() )
    ratio = ref_length / prob_length
    tm = np.zeros((4,4), np.double)
    for i in range(3):
        tm[i,i] = ratio
    rdMolTransforms.TransformConformer(temp.GetConformer(), tm)
    AllChem.GenerateDepictionMatching2DStructure(mol, temp)
    return mol

Check the function.

m=bondnormalize(mol, scaf)

Wow it worked well.

It is rare case that reader would like to do like that. RDKit has many function for chemoinformatics. It really fun!.

Today’s code can read from following URL.

Any comments and advices are highly appreciate.

Open Source Lilly’s Chemoinformatics Package

In 2012, lilly’s researchers published Lilly-MedChem Rules in J. Med. Chem. and disclosed their code on github. After the publication, the rules are used in many applications, papers and chemoinformatics applications. Open source tool made a big impact on chemoinformatics. Several hours ago I found an interesting tweet from @jcheminf.

They reported an algorithm of retro-synthesis. Data driven retrosynthetic analysis is hot topics in chemoinformatics area I think.

The article is published from Lilly and the author uploaded source code on github. URL is below.

Their implementation is different from Segler’s approach ‘Learning plan to chemical synthesis‘. They do not use machine learning approach but use Reverse Reaction Template (RRT) based approach. RRT defines reaction rules and it extracted from mapped reaction data such as Lowe’s US patent dataset.

At first they made RRT repository and used it for analysis. After making the repository. Researcher inputs query structure to the system, the system will search RRT which is applicable for the query and recode it when matched.
Key point is how to make RRT I think. More details are described in the article.
They benchmarked their system with 919 known drug structures from drug bank. The performance of results seems depends on settings of RRT, radius and support. radi-0 RRT seems more general than radi-2 RRT, it likes ECFP.(Table 2)

After reading the article, I would like to use the code. OK let’s try it!

I have checked the repo last year but the code supports linux only at that time. However now, the code supports not only linux but also OSx. ;-)
It is easy to install the tool-kit. OK let’s install the TK and use it.
For installation, gcc >= 6.2.0 and zlib>=1.2.11 are required, so I installed them with home brew.

iwatobipen$ brew install zlib
iwatobipen$ brew install gcc

Then clone the repository and change ZLIB part in makefile.public.OSX-gcc-8. I installed zlib via Homebrew, so I changed ZLIB to ‘/usr/loca/Cellar/zlib….’.

All code are implemented in C++ and the code does not use any chemoinformatics packages such as RDKit, openbabel and CDK!! @_@

iwatobipen$ git clone
iwatobipen$ cd LillyMol
iwatobipen$ vim makefile.public.OSX-gcc-8
-- ZLIB =  /usr/local/opt/zlib/lib
++ ZLIB =  ZLIB = /usr/local/Cellar/zlib/1.2.11/lib

Now ready! After makefile change, run the After wait several minutes, installation will finish. All commands are generated in ./bin/OSX-gcc-8/. There are many commands are provided.

iwatobipen$ cd bin/OSX-gcc-8/
iwatobipen$ ls
activity_consistency	iwcut			msort			ring_extraction		rxn_substructure_search	trxn
common_names		iwdemerit		preferred_smiles	ring_trimming		smiles_mutation		tsubstructure
concat_files		mol2qry			random_smiles		rotatable_bonds		sp3_filter		unique_molecules
fetch_smiles_quick	molecular_scaffold	retrosynthesis		rxn_signature		tautomer_generation	unique_rows
fileconv		molecule_subset		rgroup			rxn_standardize		tp_first_pass

Details of the commands are described in the wiki page.

I checked retrosynthesis code with example data. It is a little difficult to set options for me.

iwatobipen$ cd ./example/retrosynthesis
iwatobipen$ cat 1Cmpds.smi 
> C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O
iwatobipen$  ../../bin/OSX-gcc-8/retrosynthesis -Y all -X kg -X kekule -X ersfrm -a 2 -q f -v -R 1 -I CentroidRxnSmi_1 -P UST:AZUCORS 1Cmpds.smi >log.txt 2>err.txt

Check log.txt and err.txt.

iwatobipen$ cat log.txt
O.Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1  via US03992389_NA CentroidRxnSmi_1 R 1 ALL
Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1  via US03992389_NA CentroidRxnSmi_1 R 1 SPFRM.1
Oc1ccc(NC(=O)C)cc1  via US03992389_NA CentroidRxnSmi_1 R 1
O=C  via US03992389_NA CentroidRxnSmi_1 R 1
O=C(NC1CCNCC1)c1ccccc1  via US03992389_NA CentroidRxnSmi_1 R 1

iwatobipen$ cat err.txt
Will not write product fragments with fewer than 2 atoms
Will keep going after an individual test failure
Will preserve Kekule forms
Will use the reaction file name as the reaction name
Reading reactions took 0 seconds
read mol smi eof
Read 1 molecules, 1 deconstructed
1 molecules deconstructed at radius 1
0 deconstructions done at radius 0
1 deconstructions done at radius 1
Set_of_Reactions: CentroidRxnSmi_1 with 164 reactions
2 molecules deconstructed at radius 1
2 molecules deconstructed
Set_of_Reactions: CentroidRxnSmi_1 with 164 reactions
 1 US03947458_NA 1 searches, 0 matches found
 1 US03947473_NA 1 searches, 0 matches found
 1 US03989717_NA 1 searches, 0 matches found
 ----- snip ;-) ------
 1 US20160002218A1_0322 1 searches, 0 matches found
 1 US20160200725A1_0864 1 searches, 0 matches found
2 molecules deconstructed at radius 1
2 molecules deconstructed
163 reactions had 0 hits
1 reactions had 1 hits

It is difficult to understand smiles strings directly for me, OK let’s visualize with RDKit!

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
parent = Chem.MolFromSmiles('C(=O)(C)NC1=CC(=C(O)C=C1)CN1CCC(NC(=O)C2=CC=CC=C2)CC1.O')
mol1 = Chem.MolFromSmiles('O.Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1')
mol2 = Chem.MolFromSmiles('Oc1ccc(NC(=O)C)cc1.C=O.O=C(NC1CCNCC1)c1ccccc1')
mol3 = Chem.MolFromSmiles('Oc1ccc(NC(=O)C)cc1')
mol4 = Chem.MolFromSmiles('O=C')
mol5 = Chem.MolFromSmiles('O=C(NC1CCNCC1)c1ccccc1')
Draw.MolsToGridImage([parent, mol1, mol2, mol3, mol4, mol5])
Single step retro synthesis with Liily’s TK

Example code worked without any problems. But it failed when I used original molecule as a query. One of the reason is that I used very limited training data. I used default data for my test. It has only 164 rxns.

I would like to try to make RRT with large data set.