Embed interactive plot in jupyter notebook with panel #chemoinformatics #RDKit #memo #panel

As you know Jupyter notebook is very useful tool for data scientist. It can analyze scientific data with nice view. And there are lots of packages for data visualization. And I often use matplotlib and seaborn for my task. However few days ago, I found an interesting package named Panel which is high level app and dashbording app for python. I posted another package dash before but I’ve never used panel. So I tried to use panel with rdkit.

Panel can install via conda from pyviz channel or pip. I installed panel by using conda.

After installed panel I tested it.

At first I tried to visualize rdkit mol object. Import packages and molecules.

import numpy as np
import pandas as pd
import os
import panel as pn
from rdkit import Chem
from rdkit import RDPaths
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import matplotlib.pyplot as plt

sdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/cdk2.sdf')
mols = [m for m in Chem.SDMolSupplier(sdf)]
for m in mols:

Then made IntSlider to set index of molecule which would like to render. Following example uses depends decorator for making interactive view.

slider = pn.widgets.IntSlider(start=0, end=len(mols), step=1, value=0)
def callback(value):
    return Draw.MolToImage(mols[value])
row = pn.Column(slider, callback)

Now molecule was rendered with slider widget.

It seems work well, next I tried to add range slider to draw molecules. Code is almost same as above.

rangeslider = pn.widgets.IntRangeSlider(start=0, end=len(mols), step=1)
def callback(value):
    return Draw.MolsToGridImage(mols[value[0]: value[1]], molsPerRow=5)
pn.Column(rangeslider, callback)

IntRangeSlider returns tuple of user defined value. So I could select range which would like to render on the notebook.

Next example, I tried to make interactive scatter plot of molecular properties.

To do it, I calculated molecular descriptors with rdkit Descriptors class.

from rdkit.Chem import Descriptors
from collections import defaultdict
dlist = Descriptors._descList
desc_dec = defaultdict(list)
for mol in mols:
    for k, f  in dlist:
df = pd.DataFrame(desc_dec)

Following example used Select widget to select x and y axis and FloatSlider for setting alpha of scatter plot.

from matplotlib.figure import Figure, FigureCanvasBase
columns = df.columns.to_list()
x = pn.widgets.Select(options=columns, name='x_', value='MolWt')
y = pn.widgets.Select(options=columns, name='y_', value='qed')
alpha = pn.widgets.FloatSlider(name='alpha', value=0.5)

@pn.depends(x.param.value, y.param.value, alpha.param.value)
def get_plot(x_v, y_v, a):
    with plt.xkcd():
        fig = Figure()
        ax = fig.subplots()
        ax.scatter(df[x_v], df[y_v], c='blue', alpha = a)
        #fig = df.plot.scatter(x_v, x_v)
        return fig
pn.Column(pn.Row(x, y, alpha), get_plot)

I updated matplotlib version to 3.1.3. From version 3.1.2 matplotlib can make plot with xkcd taste. ;) It’s easy to do it just make plot in with plt.xkcd() line.

Now the scatter plot can interactively select axis and alpha params. I would like to know how to get index of each point and get the value. If I can it I could render molecule when I mouse over the point.

I uploaded today’s code my repo.


Interactive plot can’t see from githubrepo or nbvier so if reader has interest the package I recommend to try it your own PC. ;)

Example code of DGL for chemoinformatics task #DGL #chemoinformatics #RDKit #memo

There are many publications about graph based approach for chemoinformatics area. I can’t cover all of them but still have interest these area. I think pytorch_geometric (PyG) and deep graph library (DGL) are very attractive and useful package for chemoinformaticians.

I wrote some posts about DGL and PyG. Recent DGL is more chemoinformatics friendly so I used DGL for GCN model building today.

At first I tried to use DGL from Skorch but it failed. So I use DGL only for model building. As usual, I used solubility data for the test. And used RDKit for compound structure handling. I used GCNClassifier.
To use DGL for chemoinformatics area, user should install mdtraj at first. Because mol_to_graph try to import the package but if the code can’t find mdtraj. User can’t get any error message and mol_to_graph related module can’t use.

import os
from rdkit import Chem
from rdkit import RDPaths
import numpy as np

import torch
import dgl
if torch.cuda.is_available():
    print('use GPU')
    print('use CPU')

from dgl.model_zoo.chem import GCNClassifier
from dgl.data.chem.utils import mol_to_graph
from dgl.data.chem.utils import mol_to_complete_graph
from dgl.data.chem import CanonicalAtomFeaturizer
from dgl.data.chem import CanonicalBondFeaturizer
from torch import nn

import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from torch.nn import CrossEntropyLoss

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2

After data loaded, rdkit mol objects are converted to graph objects. I used canonical atom featurizer for the task. The features used in the module is the same as deepchem.

It is easy to make graph object with DGL. Just call mol_to_complete_graph. Of course you can use smiles_to_complete_graph for SMILE strings. ;)

atom_featurizer = CanonicalAtomFeaturizer()
n_feats = atom_featurizer.feat_size('h')
> 74

ncls = 3
train_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in trainmols]
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in testmols]
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

OK let’s define GCN model. GCNClassifier takes gcn_hidden_feats argument as list object. If user would like to add n GCN layers, user should pass list with n hidden layers parameters. I added 2 GCN layers with 60 and 20 hidden layers to the following model.

# define GCN NET with 4 GCN layers
gcn_net = GCNClassifier(in_feats=n_feats,
gcn_net = gcn_net.to(device)

Then define the collate function for original dataloader object.

def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)
train_data = list(zip(train_g, train_y))
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, collate_fn=collate, drop_last=True)

Next, train the model. This example is multi label classification task so I used CrossEntropyLoss for loss function.

loss_fn = CrossEntropyLoss()
optimizer = torch.optim.Adam(gcn_net.parameters(), lr=0.01)
epoch_losses = []
epoch_accuracies = []
for epoch in range(1,201):
    epoch_loss = 0
    epoch_acc = 0
    for i, (bg, labels) in enumerate(train_loader):
        labels = labels.to(device)
        atom_feats = bg.ndata.pop('h').to(device)
        atom_feats, labels = atom_feats.to(device), labels.to(device)
        pred = gcn_net(bg, atom_feats)
        loss = loss_fn(pred, labels)
        epoch_loss += loss.detach().item()
        pred_cls = pred.argmax(-1).detach().to('cpu').numpy()
        true_label = labels.to('cpu').numpy()
        epoch_acc += sum(true_label==pred_cls) / true_label.shape[0]
    epoch_acc /= (i + 1)
    epoch_loss /= (i + 1)
    if epoch % 20 == 0:
        print(f"epoch: {epoch}, LOSS: {epoch_loss:.3f}, ACC: {epoch_acc:.3f}")

> epoch: 20, LOSS: 0.384, ACC: 0.834
> epoch: 40, LOSS: 0.327, ACC: 0.842
> epoch: 60, LOSS: 0.290, ACC: 0.874
> epoch: 80, LOSS: 0.257, ACC: 0.893
> epoch: 100, LOSS: 0.244, ACC: 0.896
> epoch: 120, LOSS: 0.249, ACC: 0.899
> epoch: 140, LOSS: 0.222, ACC: 0.903
> epoch: 160, LOSS: 0.186, ACC: 0.928
> epoch: 180, LOSS: 0.158, ACC: 0.941
> epoch: 200, LOSS: 0.178, ACC: 0.928

Plot learning process with matplot lib.

%matplotlib inline
import matplotlib.pyplot as plt
plt.plot([i for i in range(1, 201)], epoch_losses, c='b', alpha=0.6, label='loss')
plt.plot([i for i in range(1, 201)], epoch_accuracies, c='r', alpha=0.6, label='acc')

The learning curve indicates that the process works fine.

In summary DGL is useful package for chemoinformatics. So I hope more chemistry related example codes will be provided from original repository. ;)

Today’s code can be found in my repository and gist.


Make molecule mesh data #RDKit #chemoinformatics #meshlab

I have an interest to predictive model build with 3D compound information. Pytorch3d and open3d seems attractive package for me. However, to use the package, I need to convert molecular information to 3D data such as pointcloud etc.

At first I tried it to use openbabel because recent version of openbabel can convert molecule from MDL molformat to VDW pointcloud xyz format. It worked well but it was difficult to add color information such as atom kind.

So I gave up to use openbabel and shifted pymol.

I installed open source pymol from source code to my python3.7 env. And tried it. Fortunately pymol can save surface 3D data as wrl format. OK sounds good ….. Wait! Unfortunately open3d can’t read wrl format file so I need to convert wrl file to pyl (point cloud format) format.

Finally I found the solution. Use meshlab! It is easy to install meshlab to linux. Just use apt. ;)

$sudo apt -yV install meshlab

Type meshlab in the terminal meshlab application will launch. And the app can convert file format. But I would like to it from python. So I used meshlabserver from subprocess.

meshlabserver is useful for batch process. Following code is an example for reading 3d mesh data generated from rdkit.

At first, load molecule from SMILES and generate 3D conformer. The save it as MDL mol format. Then load file from pymol from library mode.

Then call ‘show surface’ and change quality of surface.

Finally save surface as wrl format. It is easy. Just call ‘save hogehoge.wrl’ in pymol. ;)

import subprocess
import open3d as o3d
from rdkit import Chem
from rdkit.Chem import AllChem
import pymol
mol = Chem.MolFromSmiles('c1ccncc1C')
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=794) # to generate same coordination of molecule
Chem.MolToMolFile(mol, 'mol.mol')
pymol.cmd.set('surface_quality', '3')

Now I could get molecular surface data as wrl. Then convert the file format. Call meshserver command with -i inputfile -o outputfile and -om vc option.

-om vc means add vertex color information to the outputfile. Following code I would like to get ply format. So -o is surface.ply.

subprocess.call('meshlabserver -i surface.wrl -o surface.ply -om vc'.split())

Readly. Open3d can read ply format. (pytorch3d can read the file format too!)

> True

Open3d has visualization method.


After calling the command above, another window will open and I can see interactive 3d pointcloud image of query molecule.

I tried several condition of surface quality settings. By using default pymol setting, very sparse data is generated.

Now ready to dive into 3D information based deep learning. ;)

Cut molecule to ring and linker with RDKit #RDKit #chemoinformatics #memo

Sometime chemists analyze molecule as small parts such as core, linker and substituents.

RDKit has functions for molecular decomposition RECAP, BRICS and rdMMPA. It’s useful functions but these functions can’t extract directly extract core and linker from molecule.

I had interested how to do it and tried it.

Following code, Core is defined Rings in the molecule and Linker is defined chain which connects two rings.

I defined main function named ‘getLinkerbond’ which extracts bonds that connects atom in ring and atom not in ring.

Then call Chem.FragmentOnBonds function to get fragments by cutting linker bond.

Let’s see the code. First, import some libraries. And define test molecules.

import copy
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import display

mol1 = Chem.MolFromSmiles("C1CCNCC1")
mol2 = Chem.MolFromSmiles("c1ccccc1")
mol3 = Chem.MolFromSmiles("C1CC1CC")
mol4 = Chem.MolFromSmiles("C1CC1CCC2CCOC2")
mol5 = Chem.MolFromSmiles("C1CCC2CCCCC2C1")
mol6 = Chem.MolFromSmiles("OC1CC2(C1)CCCC2")
mol7 = Chem.MolFromSmiles("CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N")
mol8 = Chem.MolFromSmiles("c1ccccc1c2ccccc2OC")
mol9 = Chem.MolFromSmiles("CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5")
mol10 = Chem.MolFromSmiles("C1CCCC1CCCOCOCOCOCOCc1ccccc1")
mols = [mol1,mol2,mol3,mol4,mol5,mol6,mol7,mol8,mol9,mol10]

Then define main function. I wrote helper function for the main function. is_in_samering function check the atoms which construct a bond are in same ring or not.

def is_in_samering(idx1, idx2, bond_rings):
    for bond_ring in bond_rings:
        if idx1 in bond_ring and idx2 in bond_ring:
            return True
    return False

Following code, I set atom prop at first to keep original atom index. Because GetScaffoldForMol function returns new molecule with new atom index. I would like to use original atom idex in this function.

def getLinkerbond(mol, useScaffold=True):
    res = []
    for atom in mol.GetAtoms():
        atom.SetIntProp("orig_idx", atom.GetIdx())
    for bond in mol.GetBonds():
        bond.SetIntProp("orig_idx", bond.GetIdx())
    if useScaffold:
        mol = MurckoScaffold.GetScaffoldForMol(mol)
    ring_info = mol.GetRingInfo()
    bond_rings = ring_info.BondRings()
    ring_bonds = set()
    for ring_bond_idxs in bond_rings:
        for idx in ring_bond_idxs:
    all_bonds_idx = [bond.GetIdx() for bond in mol.GetBonds()]
    none_ring_bonds = set(all_bonds_idx) - ring_bonds
    for bond_idx in none_ring_bonds:
        bgn_idx = mol.GetBondWithIdx(bond_idx).GetBeginAtomIdx()
        end_idx = mol.GetBondWithIdx(bond_idx).GetEndAtomIdx()
        if mol.GetBondWithIdx(bond_idx).GetBondTypeAsDouble() == 1.0:
            if mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 1:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
            elif not is_in_samering(bgn_idx, end_idx, bond_rings) and mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 2:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
    return res

Check the function.

for mol in mols:
    bonds = getLinkerbond(mol)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)

The function cut murckoscaffold structure as default. So bonds not connect rings is not cut.

Next, case of useScaffold False. In this case all bonds outside of rings are cut.

for mol in mols:
    bonds = getLinkerbond(mol, False)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)

Works fine.

It is useful for molecular decomposition at linker site I think.

Today’s whole code is uploaded my gist. Any comments and advices will be high appreciated. ;)

Draw RDKit mol/reaction object on HTML without static png image #RDKit #memo

I think this post might not useful for many people. Just for my memorandum.

When I make web app, I use draw SVG function for rdkit object rendering. But from last my post and Greg’s gist, I could learn draw png image without writing static png files. So I wonder that can I draw PNG image on HTML without png file.

Fortunately I could find the solution!

To do that, at first draw rdkit object and convert it png byte text.
Then encode base64 and decode UTF8, that’s all!

Generated text can render in HTML with <img src=’data:image/png;base64> tag.

Today’s code is below. ;)

Edit reaction image and draw it #RDKit #memo

I often use rdkit on jupyter notebook because notebook can render molecules very conveniently. However I couldn’t find way to edit font size of reaction some days ago.

Following code is an example. Changing atom font size of MolToImage is easy but reaction image can’t apply same manner.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import rdChemReactions as Reactions
from rdkit.Chem.Draw import IPythonConsole
from PIL import Image

mol = Chem.MolFromSmiles('c1ccncc1')
rxn = Reactions.ReactionFromSmarts('CC(=O)C>>CC(O)C', useSmiles=True)

Image from default settings

OK, next tried to change font size.

Draw.DrawingOptions.atomLabelFontSize = 30


mol object was drawn with changed font size but reaction object wasn’t.

rxn object was drawn with small font size

I found the answer to solve the issue. It was enable by using MolDraw2DCairo and change its settings. It is not efficient way because png file was required. But I could not find any other solution now.

drawer = rdMolDraw2D.MolDraw2DCairo(800, 200)
im = Image.open('test.png')

drawer = rdMolDraw2D.MolDraw2DCairo(800, 200)
im = Image.open('test2.png')

As you see, atom font size was changed. ;)

Today’s code was uploaded to my gist.

Any advice will be great fully appreciated. Have a nice weekend!!!


Greg showed me how to render PNG image without drawing temporal file.
Thanks so much!

I have some idea to using the method. I’ll post it after tried it.

Process development of fluorinated-pyrrolidin analogue #memo #organic_chemistry

Here is an interesting article for efficient synthesis of fluorinated pyrrolidin synthesis from pfizer.

Fluorine containing building blocks are often used medicinal chemistry. So efficient synthetic route is very useful for us.

Some years ago, I synthesized similar compounds 1-fluoro-2-amino cyclic amine derivatives. I tried to use almost same synthetic scheme described in scheme1. 1) Epoxydation 2) Epoxy opening with sodium azide, 3) DAST-Fluorination, 4) Azide reduction.

The scheme is stereo selective if epoxyde is chiral. I like substrate controlled synthetic rote like this.

The scheme 1 is first generation which uses mCPBA, NaN3 and DAST it is not suitable for large scale production. Then the authors group developed very efficient synthetic route. They used modified burgess type reagent. The reaction with the reagent and 1,2-trans-diol makes cis-cyclic sulfamate intermediate. Key point of the strategy transform trans diol to sys amino alcohol as protected form. And also the its hydroxyl group protected SO2NR, it means the hydroxyl group is leaving group.

So the cyclic sulfamate reacts with TBAF and give trans-Fluoro protected amine. You can check the over view of the synthesis in the online abstract.

The merit of the scheme is that switching fluorination reagent from DAST to TBAF, switching nitrogen source from sodium azide to burgess reagent I think (of course there are more advantages and it described in the article).

Thanks the authors for publishing nice synthetic approach.