Draw scaffold tree as network with molecular image #RDKit #Cytoscape

I posted new function about scaffold tree which is implemented in rdkit 2020 03 before.

In previous my post, I showed example to draw scaffold tree with networkx. It could draw the scaffold tree as a network but molecular structures are not shown on the node. For chemist, structure image is important so I tried to draw network with structure image on node.

Fortunately cyjupyter is very useful package to do it, it can call cytoscape.js internally so user can draw network with many options.

OK let’s try. At first import required packages. And set example strcutures.

from rdkit import Chem
from rdkit.Chem import rdBase
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw import IPythonConsole
import networkx as nx
from networkx.readwrite import cytoscape_data
import cyjupyter
from cyjupyter import Cytoscape
from rdkit.Chem import AllChem
from rdkit.Chem.Scaffolds import rdScaffoldNetwork
from urllib import parse
smiles_lsit = [

Then create scaffold tree following code is almost as same as previous post.

params = rdScaffoldNetwork.ScaffoldNetworkParams()
netwks = rdScaffoldNetwork.CreateScaffoldNetwork([mols[1]], params)

Then define smiles to image function by using MolDraw2DSVG which is implemented in RDKit. The function generates svg from SMIELS. It is used for rendering image on the network node.

def smi2svg(smi):
    mol = Chem.MolFromSmiles(smi)
    drawer = rdMolDraw2D.MolDraw2DSVG(690, 400)
    svg = drawer.GetDrawingText().replace("svg:", "")
    return svg
def smi2image(smi):
    svg_string = smi2svg(smi)
    impath = 'data:image/svg+xml;charset=utf-8,' + parse.quote(svg_string, safe="")
    return impath

Finally draw network with cytoscape.js!

g = nx.graph.Graph()
for idx, node in enumerate(netwks.nodes):
    g.add_node(idx, smiles=node, img=smi2image(node), hac=Chem.MolFromSmiles(node).GetNumAtoms())
g.add_edges_from([(e.beginIdx,e.endIdx) for e in netwks.edges])
cy_g = cytoscape_data(g)
  {'style': [{'css': {
      'background-color': 'blue',
      'shape' : 'rectangle',
      'border-color': 'rgb(0,0,0)',
      'border-opacity': 1.0,
      'border-width': 0.0,
      'color': '#4579e8',
    'selector': 'node'},
            {'css': {
                'width': 20.0,
            'selector': 'edge'}
cyobj=Cytoscape(data=cy_g, visual_style=stobj[0]['style'], layout_name='breadthfirst')

Now I could get scaffold tree with molecular image. ;)


I tried to change atom font size but failed. I’m trying to do that now.

I uploaded today’s code on my gist. RDKit has many functions for chemoinformatics. I would like to buy a beer to the developers! Thanks!

Test new method of rdkit:2020

Now beta version of rdkit is available from anaconda! So I would like to try it. However I would like to test without contaminating current my environment. So I tried new version of rdkit with Docker.

Fortunately rdkit can be installed via conda, so I made Dockerfie based on miniconda3. Following dockerfile used continuumio/miniconda3. By using the image I could install required packages via conda command. ;)

FROM continuumio/miniconda3
MAINTAINER iwatobipen 

RUN pip install --upgrade pip
RUN conda install -c rdkit/label/beta rdkit -y
RUN conda install -c conda-forge seaborn pandas scikit-learn -y
RUN conda install -c conda-forge networkx jupyter jupyterlab -y
RUN pip install py2cytoscape

WORKDIR /workdir


ENTRYPOINT ["jupyter-lab", "--ip=", "--port=8888", "--no-browser", "--allow-root", "--NotebookApp.token=''"]

CMD ["--notebook-dir=/workdir"]

Next, build image and run!

Type following command.

$ sudo docker build -t rdkit-beta .
$ sudo docker run -it -p 8888:8888 --rm --name rdkit-beta --mount type=bind,src=`pwd`,dst=/workdir rdkit-beta

Then jupyter-lab will launch and I can access notebook the following URL. ‘localhost:8888’

And I tried to use new function named rdScaffoldNetwork. It is the implementation of scaffold tree. If reader would like to know what is the scaffold tree, please check the original article.

It is very useful for analyze compounds scaffolds. Following code is only just used it with simple case. So it is not practical example for real drug discovery project. I would like to apply the approach against real project near the feature.

I pushed dockerfile and notebook to my repo.


Any comments, suggestion will be greatly appreciated. And thank developer of RDKit!!! ;)

Use ORBKIT for rendering MO #orbkit #rdkit #psikit #quantum_chemistry

As you know, there many packages for quantum chemistry not only commercial software but also free tools. And each soft has own output format. So user need to understand how to use it. But it is very time consuming step I think.

ORBIKIT is a modular python toolbox for cross-platform post processing of quantum chemical wavefunction data.

The reference URL is below.

It seems nice isn’t it? Fortunately the source code is disclosed on github.

So I tried to use orbkit with psikit. At first, I installed orbkit while referring the original site.

At first, I installed cython, numpy, scipy, h5py with conda and mayavi installed with pip (mayavi couldn’t install with conda due to package conflicts).

Then install orbkit.

$ cd $HOME
$ git clone https://github.com/orbkit/orbkit.git
$ export ORBKITPATH=$HOME/orbkit
$ python setup.py build_ext --inplace clean

And also I added environment variables to my .bashrc.

Now ready, let’s use orbkit. I used psikit for making fchk file for orbkit. orbkit supports many kinds of input format. If reader has interest the package, please check the original document.

Following code shows how to use psikit and visualize MO with orbkit.

After running the code, mayavi viewer launched and could get MO view, the image is shown below.

Of course current psikit has some functions for communicate pymol for rendering MO but orbkit is also useful package for quantum chemistry.

I uploaded today’s code on my github repo.


New molecular fingerprint for chemoinformatics #map4 #RDKit #memo #chemoinformatics

Molecular fingerprint(FP) is a very important for chemoinformatics because it is used for building many predictive models not only ADMET but also biological activities.

As you know, ECFP (Morgan Fingerprint) is one of golden standard FP of chemoinformatics. Because it shows stable performance against any problems. After ECFP is reported, many new fingerprint algorithm is proposed but still ECFP is used in many case.

It means that more improvement of fingerprint is required but it’s too difficult task.

Recently new fingerprint algorithm named MAP4 is proposed from Prof. Reymond lab. URL is below.

MAP of MAP4 means “MinHashed atom-pair fingerprint up to a diameter of four bonds“.

The calculation process is below.

First, the circular substructures surrounding each non-hydrogen atom in the molecule at radii 1 to r are written as canonical, non-isomeric, and rooted SMILES string .

Second, the minimum topological distance TPj,k separating each atom pair(j, k) in the input molecule is calculated.

Third,all atom-pair shingles are written for each atom pair (j, k) and each value of r, placing the two SMILES strings in lexicographical order.

Fourth, the resulting set of atom-pair shingles is hashed to a set of integers using the unique mapping SHA-1, and its corresponding transposed vector is finally MinHashed to form the MAP4 vector.

It seems similar approach to ECFP but this approach uses minhashing techniques. It works very fast and this fingerprint outperform compared to their developed MHFP, ECPF and other fingerprints which are impremented RDKit.

In their article Fig2 showed performance of virtual screening against various targets and MAP4FP outperformed in many targets.

In the Fig8 shows Jaccard distance between set of molecules. MAP4 shows better performance against not only small molecule but also large molecule such as peptide, compared to other finger print such as atom pair FP, ECFP etc.

So I would like to use the FP on my notebook, so I tried to use it.

The author disclosed source code so I could get code from github.


git clone https://github.com/reymond-group/map4
cd map4

I think original repo has bug for folded fingerprint calculation so I maed PR to original repo.

And following code used modified version of original code.

I compared the FP against morganFP came from rdkit.

Today’s code was uploaded my gist and github.

MAP4Calculator provides calculate and calculate_many methods. The first one calculate MAP4FP of molecule and second one calculates MAP4FP of list of molecules.

is_folded option is false in default, so to get folded(fixed length of) finger print, user need to change the option from False to True.

The test data shown above, Moragn FP seems more sensitive to difference of compound structure. Because similarity is lower to MAP4FP. and folded and unfolded MAP4FP showed almost similar performance.

Today I showed how to calculate MAP4FP, so I would like to check the FP performance with real drug discovery project with any machine learning algorithms. :-)

I also uploaded the code to my github repo.


Benchmarking platform for generative models. #RDKit #Chemoinformatics #DeepLearning #guacamol

Yesterday I posted benchmarking platform named ‘moses’ and found it worked for test data. And then I could get comment from @Mufei Li, developer of DGL that how about to try guacamol. I checked guacamol before but didn’t try it. So I installed guacamol and used it.

From original repo, GuacaMol is an open source Python package for benchmarking of models for de novo molecular design.

This package is developed by BenevolentAI. This package can assess de novo molecular generator of following metrics.

  1. Validity
  2. Uniqueness
  3. Novelty
  4. Fr´echet ChemNet Distance
  5. KL divergence

5 is different point to moses. KL divergence metrics is based on molecular descriptors. It is not fingerprint base. So different structure but similar molecular property sets will show high score.

For test the package I installed guacamol and used it.

Fortunately, guacamol can install via pip ;).

And I tested two metrics one was KLDivBenchmark and the other was UniquenessBenchmark. As you can see most of the following code is same as test code of original repository.

I uploaded my code to my gist.

Moses and Guacamol are both useful package for benchmarking.

It is important that we should get benchmark data compared to same baseline. However in the real drug discovery project, molecular properties which is required in the projects are depends on their situation. So it is not big problem of generator performance because current models can generate reasonable structures I think.

We need to go beyond that… ;)

Benchmarking platform for generative models. #RDKit #Chemoinformatics #DeepLearning #moses

There are lots of publications about molecular generators. Each publication implements novel algorithms so we need tool for comparing these models that which is better for us.

I often use PCA, tSNE for chemical space visualization and calculate some scores such as QED, SA/SC Score and molecular properties. However I need the unified metrics. So I think Molecular Sets(MOSES) is nice tool to do it.

MOSES provides useful metrics shown below.

  1. Fragment similarity (FRAG) which is defined as the cosine distance between vectors of fragment frequencies.
  2. Scaffold similarity (SCAFF) which is defined as similar as FRAG but the metrics uses frequency of scaffolds instead of fragments.
  3. Nearest neighbor similarity (SNN) which is the average Tanimoto similarity between test molecules and generated molecules.
  4. Internal diversity (IntDiv_p) assesses the chemical diversity within the generated set of molecules. p=1 or 2
    IntDiv_p(G)=1-p\sqrt{\frac{1}{|G2|}-\sum_T(m_1, m_2)^p}
  5. Freched ChemNet Distance (FCD) which uses the penultimate layer of ChemNet and measure distance of reference molecules and generated molecules.

Fortunately source code of MOSES are freely available. I installed moses in my PC and test it.

MOSES repository provides install script, so it is easy to install moses and required packages. I modified setup.py and install it because I use torch=1.2.0 but original setup.py requires ver 1.1.0. So I commented out the line 16 of setup.py.

After installed the package, I used moses from jupyter notebook. All molecules are borrowed from test script of the repository.

Following code is an example. It is easy to get metrics, just calling metrics.get_all_metrics(testmolecules, generatedmolecules). As shown below the method calculate all metrics which are implemented MOSES and show some additional properties such as ratio of valid molecule, qed, molwt etc.

import pandas as pd
from rdkit import Chem

from moses import metrics

test = ['Oc1ccccc1-c1cccc2cnccc12','COc1cccc(NC(=O)Cc2coc3ccc(OC)cc23)c1']
test_sf = ['COCc1nnc(NC(=O)COc2ccc(C(C)(C)C)cc2)s1',
gen = ['CNC', 'Oc1ccccc1-c1cccc2cnccc12',

metrics.get_all_metrics(test, gen, k=3)

>> out

{'valid': 0.8,
 'unique@3': 1.0,
 'FCD/Test': 52.58373533265676,
 'SNN/Test': 0.3152585653588176,
 'Frag/Test': 0.30000000000000004,
 'Scaf/Test': 0.5,
 'IntDiv': 0.7189187332987785,
 'IntDiv2': 0.49790709357032226,
 'Filters': 0.75,
 'logP': 4.9581881764518005,
 'SA': 0.5086898026154574,
 'QED': 0.045033731661603064,
 'NP': 0.2902816615644048,
 'weight': 14761.927533455337}

I think moses is very useful tool for checking performance of molecular generator. Thanks the author for developing such as a nice tool for in silico drug discovery!

Today’s my code uploaded gist.

Enjoy chemoinformatics! ;)

Example usage of psi4-openmm-interface #Psi4 #OpenMM #RDKit

Molecular dynamics and Quantum Chemistry are important tools for CADD. I have interested in these topics and OpenMM and Psi4 are nice tool to handing MD and QM.

Today I tried to use psi4-openmm-interface which allows passing of molecular systems between each program.

I reviewed test script and found that the package pass the molecule which optimized with openmm to psi4 as xyz format. And also the package can conduct MD calculation very simply.

I used it, following code is almost same as original repo.

At first, I install the package you can find how to install it in following URL.

Install psi4, openmm and set PYTHONPASS. It was simple. And I need to modify some code to import the package correctly.

  1. It uses PeriodicTable class of qcelemental, but the module can’t import _temp_symbol attribute from periodictable.py, so I added _temp_symbol line to qcelemental/periodic_table.py
# qcelemental/periodic_table.py 
class PeriodicTable:
    """Nuclear and mass data about chemical elements from NIST.


    A : list of int
        Mass number, number of protons and neutrons, starting with 0 for dummies.
    Z : list of int
        Atomic number, number of protons, starting with 0 for dummies.
    E : list of str
        Element symbol from periodic table, starting with "X" for dummies. "Fe" capitalization.
    EA : list of str
        Nuclide symbol in E + A form, e.g., "Li6".
        List `EA` is a superset of `E`; that is, both "Li6" and "Li" present.
        For hydrogen, "D" and "T" also included.
    mass : list of :py:class:`decimal.Decimal`
        Atomic mass [u].
        For nuclides (e.g., "Li6"), the reported mass.
        For stable elements (e.g., "Li"), the mass of the most abundant isotope ("Li7").
        For unstable elements (e.g., "Pu"), the mass of the longest-lived isotope ("Pu244").
    name : list of str
        Element name from periodic table, starting with "Dummy". "Iron" capitalization.


    def __init__(self):

        from . import data

        # Of length number of elements
        self.Z = data.nist_2011_atomic_weights["Z"]
        self.E = data.nist_2011_atomic_weights["E"]
        self.name = data.nist_2011_atomic_weights["name"]

        self._el2z = dict(zip(self.E, self.Z))
        self._z2el = collections.OrderedDict(zip(self.Z, self.E))
        self._element2el = dict(zip(self.name, self.E))
        self._el2element = dict(zip(self.E, self.name))

        # Of length number of isotopes
        self._EE = data.nist_2011_atomic_weights["_EE"]
        self.EA = data.nist_2011_atomic_weights["EA"]
        self.A = data.nist_2011_atomic_weights["A"]
        self.mass = data.nist_2011_atomic_weights["mass"]

        self._eliso2mass = dict(zip(self.EA, self.mass))
        self._eliso2el = dict(zip(self.EA, self._EE))
        self._eliso2a = dict(zip(self.EA, self.A))
        self._el2a2mass = collections.defaultdict(dict)
        for EE, m, A in zip(self._EE, self.mass, self.A):
            self._el2a2mass[EE][A] = float(m)
       # following code was added
        self._temp_symbol =  [
    "X", "H", "HE", "LI", "BE", "B", "C", "N", "O", "F", "NE", "NA", "MG", "AL", "SI", "P", "S", "CL", "AR", "K", "CA",
    "SC", "TI", "V", "CR", "MN", "FE", "CO", "NI", "CU", "ZN", "GA", "GE", "AS", "SE", "BR", "KR", "RB", "SR", "Y",
    "ZR", "NB", "MO", "TC", "RU", "RH", "PD", "AG", "CD", "IN", "SN", "SB", "TE", "I", "XE", "CS", "BA", "LA", "CE",
    "PR", "ND", "PM", "SM", "EU", "GD", "TB", "DY", "HO", "ER", "TM", "YB", "LU", "HF", "TA", "W", "RE", "OS", "IR",
    "PT", "AU", "HG", "TL", "PB", "BI", "PO", "AT", "RN", "FR", "RA", "AC", "TH", "PA", "U", "NP", "PU", "AM", "CM",
    "BK", "CF", "ES", "FM", "MD", "NO", "LR", "RF", "DB", "SG", "BH", "HS", "MT", "DS", "RG", "UUB", "UUT", "UUQ",
    "UUP", "UUH", "UUS", "UUO"

2. Line 81 of psiomm/molecule.py should be change below.

            # from

3. gaff2.xml which is forcefield parameters is placed in openmm installed path openmm/app/data/. Because for my env, gaff2.xml file couldn’t find in the folder.

4. Got cov_radii.py from qcdb/qcdb and placed in psi4/driver/qcdb folder.

Now ready. Let’s write code.

Import some packages at first I used psikit for molecular geometry generation.

import numpy as np

from psikit import Psikit
from psiomm import molecule
from psiomm import psi_omm as po

from simtk.openmm.app import Simulation
from simtk.openmm.app import ForceField
from simtk.openmm.app import Modeller
from simtk.openmm.app import PDBReporter

from simtk.openmm import Platform
from simtk.openmm import LangevinIntegrator

from simtk.openmm import Vec3

from simtk.unit import picosecond
from simtk.unit import kelvin
from simtk.unit import kilocalorie_per_mole
from simtk.unit import angstrom

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole

Work flow

  • make molecule from xyz string => solute

  • generate bonds

  • generate atom types

  • generate charges

  • make topology

  • define forcefield

  • generate template with FF, topology and solute

  • make modeller

  • perform simulation to get trajectory

pk = Psikit()

Psikit generate xyz strings with charge and multiplicity information but psiomm can’t read the line so I remove them and passed the string to psiomm. Then generate some molecular properties.

# remove mutiplicity, formal charge
xyzstring = '\n'.join(pk.mol2xyz().split('\n')[2:])
solute = molecule.Molecule.from_xyz_string(xyzstring)
# check generated atoms and charges
for i, at in enumerate(solute.atom_types):
    print(i, at, solute.charges[i])

0 c3 -0.16240415191041357
1 c3 0.03679384783580364
2 os -0.263056064493707
17 h4 0.07500144690962385

Following workflow is based on openmm. GAFF2 is implemented in AMBER and TIP3P is water model which is implemented in CHARMM.

topology = po.make_topology(solute)
forcefield = ForceField('gaff2.xml', 'tip3p.xml')
po.generateTemplate(forcefield, topology, solute)
coords = []
for i in range(len(solute.xyz)):
    coords.append(Vec3(solute.xyz[i][0], solute.xyz[i][1], solute.xyz[i][2]))
positions = coords * angstrom

Make modeller with topology and positions. I added 50 water molecules to the model. And defined the simulation system.

modeller = Modeller(topology, positions)
modeller.addSolvent(forcefield, numAdded=50)
omm_sys = forcefield.createSystem(modeller.topology)
integrator = LangevinIntegrator(300*kelvin, 1/picosecond, 0.002*picosecond)

simulation = Simulation(modeller.topology,

I used PDBReporter module to store the simulation. Simulation step set 400 and reporter object save data each 10 step. So I could get output pdb with 40 (400 /10 ) states.

simulation.reporters.append(PDBReporter('output.pdb', 10))
MD movie

It worked fine. following code didn’t use psi4 based calculation, to do it I need to modify input file because last data has not only target molecule but also 50 water molecules. I’ll check the package and psi4 more.

Today’s code can find from my gist/github repo.