Draw scaffold tree as network with molecular image #RDKit #Cytoscape

I posted new function about scaffold tree which is implemented in rdkit 2020 03 before.

In previous my post, I showed example to draw scaffold tree with networkx. It could draw the scaffold tree as a network but molecular structures are not shown on the node. For chemist, structure image is important so I tried to draw network with structure image on node.

Fortunately cyjupyter is very useful package to do it, it can call cytoscape.js internally so user can draw network with many options.

OK let’s try. At first import required packages. And set example strcutures.

from rdkit import Chem
from rdkit.Chem import rdBase
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw import IPythonConsole
import networkx as nx
from networkx.readwrite import cytoscape_data
import cyjupyter
from cyjupyter import Cytoscape
from rdkit.Chem import AllChem
from rdkit.Chem.Scaffolds import rdScaffoldNetwork
from urllib import parse
smiles_lsit = [

Then create scaffold tree following code is almost as same as previous post.

params = rdScaffoldNetwork.ScaffoldNetworkParams()
netwks = rdScaffoldNetwork.CreateScaffoldNetwork([mols[1]], params)

Then define smiles to image function by using MolDraw2DSVG which is implemented in RDKit. The function generates svg from SMIELS. It is used for rendering image on the network node.

def smi2svg(smi):
    mol = Chem.MolFromSmiles(smi)
    drawer = rdMolDraw2D.MolDraw2DSVG(690, 400)
    svg = drawer.GetDrawingText().replace("svg:", "")
    return svg
def smi2image(smi):
    svg_string = smi2svg(smi)
    impath = 'data:image/svg+xml;charset=utf-8,' + parse.quote(svg_string, safe="")
    return impath

Finally draw network with cytoscape.js!

g = nx.graph.Graph()
for idx, node in enumerate(netwks.nodes):
    g.add_node(idx, smiles=node, img=smi2image(node), hac=Chem.MolFromSmiles(node).GetNumAtoms())
g.add_edges_from([(e.beginIdx,e.endIdx) for e in netwks.edges])
cy_g = cytoscape_data(g)
  {'style': [{'css': {
      'background-color': 'blue',
      'shape' : 'rectangle',
      'border-color': 'rgb(0,0,0)',
      'border-opacity': 1.0,
      'border-width': 0.0,
      'color': '#4579e8',
    'selector': 'node'},
            {'css': {
                'width': 20.0,
            'selector': 'edge'}
cyobj=Cytoscape(data=cy_g, visual_style=stobj[0]['style'], layout_name='breadthfirst')

Now I could get scaffold tree with molecular image. ;)


I tried to change atom font size but failed. I’m trying to do that now.

I uploaded today’s code on my gist. RDKit has many functions for chemoinformatics. I would like to buy a beer to the developers! Thanks!

Visualize missing data with ipython notebook #visualization

Combination of Pandas and jupyter notebook is useful for data analysis. In the case of handling real world data, there many data set which has missing value.

So fist step of data analysis is checking these missing data. Pandas_profiling is the one of useful package to do it. It’s able to make rich report with very short code. However it isn’t suitable for large scale dataset.

To handling large data set, missingno is another candidate to analyze data.

Both tool can be installed with pip command.

I used missingno with small data and compared to pandas_profiling today.

I borrowed data set from this blog post. The data is about diabetes.
Both package could make nice visualization with few lines. pandas_profiling provides very information rich view as show below.

data overview
missing data view

And missingno can make not only same missing data view but also bar chart, heat map which shows nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

Today’s simple code uploaded my gist.


It’s very important and difficult problem that how to handle missing data. To visualize these data will be first step of data analysis. So missingno is useful package for data analyst I think.

Test new method of rdkit:2020

Now beta version of rdkit is available from anaconda! So I would like to try it. However I would like to test without contaminating current my environment. So I tried new version of rdkit with Docker.

Fortunately rdkit can be installed via conda, so I made Dockerfie based on miniconda3. Following dockerfile used continuumio/miniconda3. By using the image I could install required packages via conda command. ;)

FROM continuumio/miniconda3
MAINTAINER iwatobipen 

RUN pip install --upgrade pip
RUN conda install -c rdkit/label/beta rdkit -y
RUN conda install -c conda-forge seaborn pandas scikit-learn -y
RUN conda install -c conda-forge networkx jupyter jupyterlab -y
RUN pip install py2cytoscape

WORKDIR /workdir


ENTRYPOINT ["jupyter-lab", "--ip=", "--port=8888", "--no-browser", "--allow-root", "--NotebookApp.token=''"]

CMD ["--notebook-dir=/workdir"]

Next, build image and run!

Type following command.

$ sudo docker build -t rdkit-beta .
$ sudo docker run -it -p 8888:8888 --rm --name rdkit-beta --mount type=bind,src=`pwd`,dst=/workdir rdkit-beta

Then jupyter-lab will launch and I can access notebook the following URL. ‘localhost:8888’

And I tried to use new function named rdScaffoldNetwork. It is the implementation of scaffold tree. If reader would like to know what is the scaffold tree, please check the original article.

It is very useful for analyze compounds scaffolds. Following code is only just used it with simple case. So it is not practical example for real drug discovery project. I would like to apply the approach against real project near the feature.

I pushed dockerfile and notebook to my repo.


Any comments, suggestion will be greatly appreciated. And thank developer of RDKit!!! ;)

Use ORBKIT for rendering MO #orbkit #rdkit #psikit #quantum_chemistry

As you know, there many packages for quantum chemistry not only commercial software but also free tools. And each soft has own output format. So user need to understand how to use it. But it is very time consuming step I think.

ORBIKIT is a modular python toolbox for cross-platform post processing of quantum chemical wavefunction data.

The reference URL is below.

It seems nice isn’t it? Fortunately the source code is disclosed on github.

So I tried to use orbkit with psikit. At first, I installed orbkit while referring the original site.

At first, I installed cython, numpy, scipy, h5py with conda and mayavi installed with pip (mayavi couldn’t install with conda due to package conflicts).

Then install orbkit.

$ cd $HOME
$ git clone https://github.com/orbkit/orbkit.git
$ export ORBKITPATH=$HOME/orbkit
$ python setup.py build_ext --inplace clean

And also I added environment variables to my .bashrc.

Now ready, let’s use orbkit. I used psikit for making fchk file for orbkit. orbkit supports many kinds of input format. If reader has interest the package, please check the original document.

Following code shows how to use psikit and visualize MO with orbkit.

After running the code, mayavi viewer launched and could get MO view, the image is shown below.

Of course current psikit has some functions for communicate pymol for rendering MO but orbkit is also useful package for quantum chemistry.

I uploaded today’s code on my github repo.


New molecular fingerprint for chemoinformatics #map4 #RDKit #memo #chemoinformatics

Molecular fingerprint(FP) is a very important for chemoinformatics because it is used for building many predictive models not only ADMET but also biological activities.

As you know, ECFP (Morgan Fingerprint) is one of golden standard FP of chemoinformatics. Because it shows stable performance against any problems. After ECFP is reported, many new fingerprint algorithm is proposed but still ECFP is used in many case.

It means that more improvement of fingerprint is required but it’s too difficult task.

Recently new fingerprint algorithm named MAP4 is proposed from Prof. Reymond lab. URL is below.

MAP of MAP4 means “MinHashed atom-pair fingerprint up to a diameter of four bonds“.

The calculation process is below.

First, the circular substructures surrounding each non-hydrogen atom in the molecule at radii 1 to r are written as canonical, non-isomeric, and rooted SMILES string .

Second, the minimum topological distance TPj,k separating each atom pair(j, k) in the input molecule is calculated.

Third,all atom-pair shingles are written for each atom pair (j, k) and each value of r, placing the two SMILES strings in lexicographical order.

Fourth, the resulting set of atom-pair shingles is hashed to a set of integers using the unique mapping SHA-1, and its corresponding transposed vector is finally MinHashed to form the MAP4 vector.

It seems similar approach to ECFP but this approach uses minhashing techniques. It works very fast and this fingerprint outperform compared to their developed MHFP, ECPF and other fingerprints which are impremented RDKit.

In their article Fig2 showed performance of virtual screening against various targets and MAP4FP outperformed in many targets.

In the Fig8 shows Jaccard distance between set of molecules. MAP4 shows better performance against not only small molecule but also large molecule such as peptide, compared to other finger print such as atom pair FP, ECFP etc.

So I would like to use the FP on my notebook, so I tried to use it.

The author disclosed source code so I could get code from github.


git clone https://github.com/reymond-group/map4
cd map4

I think original repo has bug for folded fingerprint calculation so I maed PR to original repo.

And following code used modified version of original code.

I compared the FP against morganFP came from rdkit.

Today’s code was uploaded my gist and github.

MAP4Calculator provides calculate and calculate_many methods. The first one calculate MAP4FP of molecule and second one calculates MAP4FP of list of molecules.

is_folded option is false in default, so to get folded(fixed length of) finger print, user need to change the option from False to True.

The test data shown above, Moragn FP seems more sensitive to difference of compound structure. Because similarity is lower to MAP4FP. and folded and unfolded MAP4FP showed almost similar performance.

Today I showed how to calculate MAP4FP, so I would like to check the FP performance with real drug discovery project with any machine learning algorithms. :-)

I also uploaded the code to my github repo.


Benchmarking platform for generative models. #RDKit #Chemoinformatics #DeepLearning #guacamol

Yesterday I posted benchmarking platform named ‘moses’ and found it worked for test data. And then I could get comment from @Mufei Li, developer of DGL that how about to try guacamol. I checked guacamol before but didn’t try it. So I installed guacamol and used it.

From original repo, GuacaMol is an open source Python package for benchmarking of models for de novo molecular design.

This package is developed by BenevolentAI. This package can assess de novo molecular generator of following metrics.

  1. Validity
  2. Uniqueness
  3. Novelty
  4. Fr´echet ChemNet Distance
  5. KL divergence

5 is different point to moses. KL divergence metrics is based on molecular descriptors. It is not fingerprint base. So different structure but similar molecular property sets will show high score.

For test the package I installed guacamol and used it.

Fortunately, guacamol can install via pip ;).

And I tested two metrics one was KLDivBenchmark and the other was UniquenessBenchmark. As you can see most of the following code is same as test code of original repository.

I uploaded my code to my gist.

Moses and Guacamol are both useful package for benchmarking.

It is important that we should get benchmark data compared to same baseline. However in the real drug discovery project, molecular properties which is required in the projects are depends on their situation. So it is not big problem of generator performance because current models can generate reasonable structures I think.

We need to go beyond that… ;)

Benchmarking platform for generative models. #RDKit #Chemoinformatics #DeepLearning #moses

There are lots of publications about molecular generators. Each publication implements novel algorithms so we need tool for comparing these models that which is better for us.

I often use PCA, tSNE for chemical space visualization and calculate some scores such as QED, SA/SC Score and molecular properties. However I need the unified metrics. So I think Molecular Sets(MOSES) is nice tool to do it.

MOSES provides useful metrics shown below.

  1. Fragment similarity (FRAG) which is defined as the cosine distance between vectors of fragment frequencies.
  2. Scaffold similarity (SCAFF) which is defined as similar as FRAG but the metrics uses frequency of scaffolds instead of fragments.
  3. Nearest neighbor similarity (SNN) which is the average Tanimoto similarity between test molecules and generated molecules.
  4. Internal diversity (IntDiv_p) assesses the chemical diversity within the generated set of molecules. p=1 or 2
    IntDiv_p(G)=1-p\sqrt{\frac{1}{|G2|}-\sum_T(m_1, m_2)^p}
  5. Freched ChemNet Distance (FCD) which uses the penultimate layer of ChemNet and measure distance of reference molecules and generated molecules.

Fortunately source code of MOSES are freely available. I installed moses in my PC and test it.

MOSES repository provides install script, so it is easy to install moses and required packages. I modified setup.py and install it because I use torch=1.2.0 but original setup.py requires ver 1.1.0. So I commented out the line 16 of setup.py.

After installed the package, I used moses from jupyter notebook. All molecules are borrowed from test script of the repository.

Following code is an example. It is easy to get metrics, just calling metrics.get_all_metrics(testmolecules, generatedmolecules). As shown below the method calculate all metrics which are implemented MOSES and show some additional properties such as ratio of valid molecule, qed, molwt etc.

import pandas as pd
from rdkit import Chem

from moses import metrics

test = ['Oc1ccccc1-c1cccc2cnccc12','COc1cccc(NC(=O)Cc2coc3ccc(OC)cc23)c1']
test_sf = ['COCc1nnc(NC(=O)COc2ccc(C(C)(C)C)cc2)s1',
gen = ['CNC', 'Oc1ccccc1-c1cccc2cnccc12',

metrics.get_all_metrics(test, gen, k=3)

>> out

{'valid': 0.8,
 'unique@3': 1.0,
 'FCD/Test': 52.58373533265676,
 'SNN/Test': 0.3152585653588176,
 'Frag/Test': 0.30000000000000004,
 'Scaf/Test': 0.5,
 'IntDiv': 0.7189187332987785,
 'IntDiv2': 0.49790709357032226,
 'Filters': 0.75,
 'logP': 4.9581881764518005,
 'SA': 0.5086898026154574,
 'QED': 0.045033731661603064,
 'NP': 0.2902816615644048,
 'weight': 14761.927533455337}

I think moses is very useful tool for checking performance of molecular generator. Thanks the author for developing such as a nice tool for in silico drug discovery!

Today’s my code uploaded gist.

Enjoy chemoinformatics! ;)