Plot Chemical space with d3js based library #RDKit #Chemoinformatics

Recently I posted making interactive plot on jupyter notebook.
I used altair for doing it. Today, I used d3js and matplotlib based package to make scatter plot.
mlpd3 is another tool for making interactive plot with python.

In the original site, many examples are provided. It seems easy to make any kinds of plot with tooltip. ;) So, I want to try whether the module can SVG image as tooltip.
From original site document mlpd3 supports following version of python. The mpld3 project is compatible with Python 2.6-2.7 and 3.3-3.4. But I want to run my code on python3.6. So I installed mpld3 to py3.6 env.
And current version of mpld3 causes error. I referred SOF and installed bug fixed version from github repo.

python -m pip install --user "git+"

Now ready. Let’s write code! To run the code on jupyter notebook, mpld3.enable_notebook() option is recommended.

%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mpld3
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import DataStructs
from sklearn.decomposition import PCA
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from mpld3 import plugins

Then perform PCA with morgan fingerprint. This is not so difficult. And make moltosvg function for tooltip. I browed the function from rdkit blog post. I used svg strings as tooltip.

def fp2arr(fp):
    arr = np.zeros((1,))
    return arr

# Original code is described in the rdkit blog post.

def moltosvg(mol,molSize=(450,15),kekulize=True):
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
    drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
    svg = drawer.GetDrawingText()
    return svg.replace('svg:','')

fpgen = rdFingerprintGenerator.GetMorganGenerator(2)
mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf'))]
for m in mols:
fps = [fpgen.GetFingerprint(m) for m in mols]
X = np.asarray([fp2arr(fp) for fp in fps])

pca = PCA(n_components=3)
res = pca.fit_transform(X)
# make set of SVG
svgs = [moltosvg(m) for m in mols]

Finally make scatter plot. This can do with almost same way to matplotlib. To use mlpd3, user do not need to write Javascript for making tooltip. Of course you can write your own javascript for implementation of more complex features .

fig, ax = plt.subplots()
ax.set_title('Viz chemical space!')
points = ax.scatter(res[:,0], res[:,1])
# This is key point for making tooltip!
tooltip = plugins.PointHTMLTooltip(points, svgs)
plugins.connect(fig, tooltip)

After running the code, I could get interactive plot which has chemical structures as a tooltip. ;)
I would like to make frame work for chemical space visualizer.

Scatter plot

Whole code can view from nb viewer.


Generate possible molecules from a dataset #Chemoinformatics #RDKit

Recently @dr_greg_landrum  shared very cool Knime work flow which can enumerate possible molecules from given dataset. You can get the Work flow from following URL.
This work flow is really useful and elegant. If you have interested in, I recommend to use it!!!
And Patrick Walters disclosed nice code for Free-Wilson analysis on github.

Inspired these nice code, I want to write code for molecular enumeration from given molecules in python. ;)
At frist, import libraries.

import os
import numpy as np
import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdFMCS
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdRGroupDecomposition
from rdkit.Chem import rdmolops
from rdkit.Chem import RDConfig
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Scaffolds import MurckoScaffold

from collections import defaultdict
from itertools import product
import igraph
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import re
IPythonConsole.ipython_useSVG = True

Then define functions. To do retrieve set of similar compounds from the dataset, I used molecular similarity graph. I used python-igraph. To use igraph, I can pick up the largest subgraph from parent graph.
In this post, I used cdk2.sdf as demo data.

def gengraph(mols,fpgen , threshold=0.7):
    fps = [fpgen.GetFingerprint(m) for m in mols]
    num_v = len(mols)
    graph = igraph.Graph()
    for i in range(num_v):
        for j in range(i):
            if DataStructs.TanimotoSimilarity(fps[i], fps[j]) >= threshold:
                graph.add_edge(i, j)
    return graph
fpgen = rdFingerprintGenerator.GetMorganGenerator(2)
mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf'))]
for mol in mols:
fps = [fpgen.GetFingerprint(m) for m in mols]
graph = gengraph(mols, fpgen, 0.4)
simmols_idx = sorted(list(blks), key=lambda x: len(x), reverse=True)
simmols = [mols[i] for i in simmols_idx[0]]
scaff = [MurckoScaffold.GetScaffoldForMol(m) for m in simmols]

Then, get MCS. I investigated some conditions and I selected mcs2.

mcs1 = rdFMCS.FindMCS(scaff, threshold=0.7)
mcs2 = rdFMCS.FindMCS(scaff, threshold=0.7, 
mcs3 = rdFMCS.FindMCS(scaff, threshold=0.7,

Next, select molecule which has mcs2 as substructure from dataset.
And there is a hacky method, I define getMCSSmiles because following process needs to molecule which can sanitize as core.
The idea came from rdkit cook book.

mols_has_core = []
core = Chem.MolFromSmarts(mcs2.smartsString)
for mol in mols:
    if mol.HasSubstructMatch(core):
def getMCSSmiles(mol, mcs):
    mcsp = Chem.MolFromSmarts(mcs.smartsString)
    match = mol.GetSubstructMatch(mcsp)
    smi = Chem.MolFragmentToSmiles(mol, atomsToUse=match)
    return smi
mcs_smi = getMCSSmiles(mols_has_core[0], mcs2)
core = Chem.MolFromSmiles(mcs_smi)

Now I could get core for RGroup decomposition. So I did RGDecomposition with RDKit function. And make dataset and core for molecule generation.

rgp = rdRGroupDecomposition.RGroupDecompositionParameters()
rgp.removeHydrogensPostMatch = True
rgp.alignment =True
rg = rdRGroupDecomposition.RGroupDecomposition(core, rgp)
for mol in mols_has_core:
dataset = rg.GetRGroupsAsColumns()
core =  Chem.RemoveHs(dataset["Core"][0])

Almost there. To generate possible combination of molecules I need following steps. 1) RGroup decomposition, 2) retrieve information of side chains, 3) Generate new molecules from all possible combination of side chains.
To do that I define two functions named makebond and enumeratemol.

def makebond(target, chain):
    newmol = Chem.RWMol(rdmolops.CombineMols(target, chain))
    atoms = newmol.GetAtoms()
    mapper = defaultdict(list)
    for idx, atm in enumerate(atoms):
        atom_map_num = atm.GetAtomMapNum()
    for idx, a_list in mapper.items():
        if len(a_list) == 2:
            atm1, atm2 = a_list
            rm_atoms = [newmol.GetAtomWithIdx(atm1),newmol.GetAtomWithIdx(atm2)]
            nbr1 = [x.GetOtherAtom(newmol.GetAtomWithIdx(atm1)) for x in newmol.GetAtomWithIdx(atm1).GetBonds()][0]
            nbr2 = [x.GetOtherAtom(newmol.GetAtomWithIdx(atm2)) for x in newmol.GetAtomWithIdx(atm2).GetBonds()][0]
    newmol.AddBond(nbr1.GetIdx(), nbr2.GetIdx(), order=Chem.rdchem.BondType.SINGLE)
    newmol = newmol.GetMol()
    return newmol

def enumeratemol(core,rg, maxmol=10000):
    dataset = rg.GetRGroupsAsColumns()
    labels = list(dataset.keys())
    pat = re.compile("R\d+")
    labels = [label for label in labels if pat.match(label)]
    rgs = np.asarray([dataset[label] for label in labels])
    i, j = rgs.shape
    combs = [k for k in product(range(j), repeat=i)]
    res = []
    for i in combs:
        mol = core
        for idx,j in enumerate(i):
            mol = makebond(mol, rgs[idx][j])
        mol = Chem.RemoveHs(mol)
    return res

Now ready! Let’s go generation step. And check structures.

res = enumeratemol(core,rg)

It seems work fine. At last, check number of generated molecules.

print(14**3, len(res))
> 2744 2744

Yah! The code seems work fine. It is not so efficient way because if data has many RGroups, huge number of molecules are generated.
So I think combination of predictive model or / and molecular filter is useful for undesired molecule cutting.
I could learn useful function of RDKit today. Really useful toolkit for chemoinformatics!
Whole code of the post can view from following URL.

Any suggestion and advices are appreciated. ;)

Industrial ADME data Sets and Deep Learning #Chemoinformatcs

In pharma, predictive models are widely used such as activity, ADME, phychem and Tox. And recently many articles which use Deep Learning are published. Deep Learning is powerful tool for prediction but it is difficult to find appropriate hyper parameters. Today I read informative article published researcher from Lilly.

They investigated performance of DNN and SVM with in-house ADMET datasets such as solubility, CYP inhibition, p-gp efflux, MDCK, Metabolic stability and fu.

The author used ECFP6 as both input and DNN is implemented by using pytorch (code and dataset is not disclosed). And investigated hyper parameters are learning rate, weight decay, dropout rate, activation function and batch normalization. And they evaluated their performance with AUROC, MCC for classification and RMSE for regression.

It is worth for me that batch normalization shows big impact for reducing the potential of poor performance. It could find in Fig1.

Next they analyzed the effect of the hyper parameters on regression models. In regression models, they found that interaction of activation function and learning rate is most important. And many other results are described in the article. If reader who is interested the article I recommend to read it.

Finally it is interesting for me that DNN which has optimized hyper parameters showed marginal improvement over the SVM models.

For prediction DNN does not big advantage compared to traditional machine learning approach such as RF, SVM etc. But it has attractive feature like auto encoding, transfer learning, generator etc etc.

I need study and practice not only DNN but also traditional ML approach.

Visualize pharmacophore with RDKit #RDKit #Pymol #Chemoinformatics

Some years ago, I wrote a post about how to communicate pymol and RDKit. In the post, I demonstrated how to visualize Phamarcophore in rdkit.

And recently I got a query about the post and think about more efficient way.
What I want to write in the post is new approach to visualize pharmacophore with RDKit.

To do it, I ran pymol in server mode. The environment of pymol session is not need to same in rdkit env. I installed pymol via homebrew. So this version is python 3.7. And my environment of rdkit is python3.6.

iwatobipen$ pymol -R

Then write code on jupyter notebook. I picked up several mols from cdk2.sdf which is provided from rdkit Docs/Book/data folder.

from IPython import display
import os
import subprocess
from rdkit import Chem
from rdkit import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolAlign
from rdkit.Chem import rdShapeHelpers
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import PyMol
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import copy
import pprint
mols = Chem.SDMolSupplier('./before_align_cdk2.sdf', removeHs=False)
cdk2mol = [m for m in mols]
for m in cdk2mol:
    AllChem.EmbedMolecule(m, AllChem.ETKDGv2())

Then aligned molecules with crippenO3A method. I used first molecule as reference. And after align I saved molecule to align_cdk2.sdf file.

cdk2mol2 = copy.deepcopy(cdk2mol)
crippen_contribs = [rdMolDescriptors._CalcCrippenContribs(mol) for mol in cdk2mol2]
ref = cdk2mol2[0]
ref_contrib = crippen_contribs[0]
targets = cdk2mol2[1:]
targets_contrib = crippen_contribs[1:]

for i, target in enumerate(targets):
    crippenO3A = rdMolAlign.GetCrippenO3A(target, ref, targets_contrib[i], ref_contrib)

w = Chem.SDWriter('./align_cdk2.sdf')
for mol in targets:

Key function of ShowFeatures is ‘rdkit/Chem/Features/’
To use the function, it is needed to run the code like below.

iwatobipen$ python 

I want to call the function in jupyter notebook. So I used subprocess.
Following example shows before alignment and after alignment.

showfeatpath = os.path.join(RDConfig.RDCodeDir, 'Chem/Features/')

At first visualize molecules before align. ‘display’ function can display png image on notebook.

# Before align
v = PyMol.MolViewer()
process = subprocess.Popen(['python', showfeatpath, '--writeFeats','./before_align_cdk2.sdf'], stdout=subprocess.PIPE)
stdout = process.communicate()[0]

Hmm, the function can detect pharmacophores of each molecule but not aligned form. Next visualize molecules

#After align
v = PyMol.MolViewer()
process = subprocess.Popen(['python', showfeatpath,'--writeFeats','./align_cdk2.sdf'], stdout=subprocess.PIPE)
stdout = process.communicate()[0]

Now I could get following image.

And I set –writeFeats option. The option provides position of each pharmacophores.

res = stdout.decode('utf-8').replace('\t', ' ').split('\n')

['# Family    X     Y    Z    Radius  # Atom_ids',
 'Donor -4.447 -0.580 -2.127 1.0 # 11',
 'Donor -2.369 -0.730 -2.637 1.0 # 13',
 'Donor -4.121 -0.009 0.275 1.0 # 14',
 'Donor -3.602 0.536 2.585 1.0 # 17',
 'Acceptor 2.288 -0.534 -1.549 1.0 # 5',
 'Acceptor -0.161 -0.294 -0.684 1.0 # 7',
 'Acceptor -2.369 -0.730 -2.637 1.0 # 13',
 'Acceptor -4.121 -0.009 0.275 1.0 # 14',
 'Acceptor -1.919 0.112 0.908 1.0 # 16',
 'PosIonizable -3.325 -0.576 -2.046 1.0 # 9 13 12 11 10',
 'Aromatic -3.325 -0.576 -2.046 1.0 # 9 10 11 12 13',
 'Aromatic -2.826 -0.102 -0.038 1.0 # 8 9 10 14 15 16',
 'Hydrophobe 3.473 -0.083 0.406 1.0 # 2',
 'LumpedHydrophobe 3.899 0.272 0.319 1.0 # 2 1 3',
 'Donor -4.491 -0.607 -2.118 1.0 # 2',
 'Donor -2.390 -0.682 -2.613 1.0 # 5',
 'Donor -4.131 -0.045 0.283 1.0 # 9',
 'Donor -3.541 0.496 2.576 1.0 # 10',
 'Acceptor -2.390 -0.682 -2.613 1.0 # 5',
 'Acceptor -1.879 0.136 0.883 1.0 # 7',
 'Acceptor -4.131 -0.045 0.283 1.0 # 9',
 'Acceptor -0.163 -0.211 -0.758 1.0 # 11',
 'Acceptor 2.410 -1.330 -1.015 1.0 # 17',
 'PosIonizable -3.355 -0.566 -2.032 1.0 # 3 2 1 5 4',
 'Aromatic -3.355 -0.566 -2.032 1.0 # 1 2 3 4 5',
 'Aromatic -2.828 -0.097 -0.047 1.0 # 3 4 6 7 8 9',
 'Donor -4.513 -0.557 -2.079 1.0 # 2',
 'Donor -2.426 -0.726 -2.623 1.0 # 5',
 'Donor -4.148 -0.000 0.288 1.0 # 9',
 'Donor -3.550 0.529 2.554 1.0 # 10',
 'Donor 2.549 0.697 -1.275 1.0 # 18',
 'Acceptor -2.426 -0.726 -2.623 1.0 # 5',
 'Acceptor -1.881 0.097 0.872 1.0 # 7',
 'Acceptor -4.148 -0.000 0.288 1.0 # 9',
 'Acceptor -0.137 -0.319 -0.735 1.0 # 11',
 'Acceptor 4.371 2.116 -1.804 1.0 # 17',
 'PosIonizable -3.375 -0.565 -2.021 1.0 # 3 2 1 5 4',
 'Aromatic -3.375 -0.565 -2.021 1.0 # 1 2 3 4 5',
 'Aromatic -2.824 -0.106 -0.053 1.0 # 3 4 6 7 8 9',

I want to do with Features.ShowFeats function directly from ipython notebook but it was difficult. So I call this function from other process.

BTW, it is interesting and cool code for visualize pharmacophores. RDKit is nice tool for chemoinformatics.

All code is uploaded to my repo.

New scalable Flow photo Chemistry Platform New Flow Chemistry Platform #photochemistry #flowchemistry

Recently visible light used organic reaction is attractive area for me. Because it can make difficult bond in short steps under mild reaction conditions.
But scale up is difficult because key factor of photo reaction is light. It means photo absorbance is important and so it is difficult to scale up. One approach is Plug Flow Reactor. It can increase illumination of the reaction solution by increasing surface-area-to volume ratio. But kilo gram scale up is still difficult.

Today I read an exciting article to overcome the issue. The article is published from researcher of AbbVie.

They proposed to use high intensity lasers for light source and Continuously stirred-tank reactor(CSTR) with Beam Expander. From freely available supporting information, reactor looks like below.

They optimized reaction condition and succeeded to conduct 1.54kg synthesis of substituted benzene derivative with C-N cross coupling in 100ml CSTR. It took only 32h!

CSTR has advantage for solid handling compared to flow reactor. I think this reactor is interesting and attractive for large scale photo chemical reaction.

It is worth to know that big pharma try to develop not only drug but also science and technology I think.

Cryo-EM data analysis with deep learning #arXiv

Recently number of publications with cryo-EM is growing. And also number of data storage is growing.
From EMDB,
I could get following statistics.

And now, Cryo-EM is collecting a lot of attention in drug discovery area because the method has possibility of determination for difficult targets that can not accessible to X-ray analysis. I’m not sure about cryo-EM but have interest the technology.

One of the challenges is molecular model building. For Cryo-EM, 2D projection images need to convert volumetric data and model the atomic coordinates of each amino acid. The modeling process is still time consuming step.

Today I found exciting article.
The authors developed new approach for molecular modeling named ‘A2-Net’ which is used Neural network (3D convolutional network) and MCTS.

Neural network is used in step one, which determines the 3D coordinates of atoms in each amino acid. And MCTS is used in step two, which prune the candidate amino acids in main chain.

It is interesting for me that, first step of A2 net is prediction of amino acids category and their coordinates from the volumetric data!

After getting the proposal, they used 3D stacked hourglass network for further pose estimation. Then MCTS is used.

Finally their method is outperform to ROSETTA-denovo.

The article indicates that Deep learning is powerful method for 3D detection. I would like to learn 3D detection and Cryo-EM.





1. Githubにもっとコードあげる
2. 英語
3. pytorch
 なんとなく使ったことはあるけどあまり深く勉強はしてなかったので今年はコレでなんか作りたいですね。今まで僕はTensorflowまたはKerasメインでしたので一度Define by runの方も触ってみようと思います