Cut molecule to ring and linker with RDKit #RDKit #chemoinformatics #memo

Sometime chemists analyze molecule as small parts such as core, linker and substituents.

RDKit has functions for molecular decomposition RECAP, BRICS and rdMMPA. It’s useful functions but these functions can’t extract directly extract core and linker from molecule.

I had interested how to do it and tried it.

Following code, Core is defined Rings in the molecule and Linker is defined chain which connects two rings.

I defined main function named ‘getLinkerbond’ which extracts bonds that connects atom in ring and atom not in ring.

Then call Chem.FragmentOnBonds function to get fragments by cutting linker bond.

Let’s see the code. First, import some libraries. And define test molecules.

import copy
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import display

mol1 = Chem.MolFromSmiles("C1CCNCC1")
mol2 = Chem.MolFromSmiles("c1ccccc1")
mol3 = Chem.MolFromSmiles("C1CC1CC")
mol4 = Chem.MolFromSmiles("C1CC1CCC2CCOC2")
mol5 = Chem.MolFromSmiles("C1CCC2CCCCC2C1")
mol6 = Chem.MolFromSmiles("OC1CC2(C1)CCCC2")
mol7 = Chem.MolFromSmiles("CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N")
mol8 = Chem.MolFromSmiles("c1ccccc1c2ccccc2OC")
mol9 = Chem.MolFromSmiles("CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5")
mol10 = Chem.MolFromSmiles("C1CCCC1CCCOCOCOCOCOCc1ccccc1")
mols = [mol1,mol2,mol3,mol4,mol5,mol6,mol7,mol8,mol9,mol10]

Then define main function. I wrote helper function for the main function. is_in_samering function check the atoms which construct a bond are in same ring or not.

def is_in_samering(idx1, idx2, bond_rings):
    for bond_ring in bond_rings:
        if idx1 in bond_ring and idx2 in bond_ring:
            return True
    return False

Following code, I set atom prop at first to keep original atom index. Because GetScaffoldForMol function returns new molecule with new atom index. I would like to use original atom idex in this function.

def getLinkerbond(mol, useScaffold=True):
    res = []
    for atom in mol.GetAtoms():
        atom.SetIntProp("orig_idx", atom.GetIdx())
    for bond in mol.GetBonds():
        bond.SetIntProp("orig_idx", bond.GetIdx())
    if useScaffold:
        mol = MurckoScaffold.GetScaffoldForMol(mol)
    ring_info = mol.GetRingInfo()
    bond_rings = ring_info.BondRings()
    ring_bonds = set()
    for ring_bond_idxs in bond_rings:
        for idx in ring_bond_idxs:
    all_bonds_idx = [bond.GetIdx() for bond in mol.GetBonds()]
    none_ring_bonds = set(all_bonds_idx) - ring_bonds
    for bond_idx in none_ring_bonds:
        bgn_idx = mol.GetBondWithIdx(bond_idx).GetBeginAtomIdx()
        end_idx = mol.GetBondWithIdx(bond_idx).GetEndAtomIdx()
        if mol.GetBondWithIdx(bond_idx).GetBondTypeAsDouble() == 1.0:
            if mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 1:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
            elif not is_in_samering(bgn_idx, end_idx, bond_rings) and mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 2:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
    return res

Check the function.

for mol in mols:
    bonds = getLinkerbond(mol)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)

The function cut murckoscaffold structure as default. So bonds not connect rings is not cut.

Next, case of useScaffold False. In this case all bonds outside of rings are cut.

for mol in mols:
    bonds = getLinkerbond(mol, False)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)

Works fine.

It is useful for molecular decomposition at linker site I think.

Today’s whole code is uploaded my gist. Any comments and advices will be high appreciated. ;)

Make report with rdkit and matplotlib #RDKit #memo #chemoinformatics

Recently jupyternotebook is being very powerful and useful tool for chemoinformatitian. It can be able to not only analysis but also visualize data. Of course I love it.

However sometime it is not so friendly tool to medchem I think. So I think that it is nice to having reporting function for medchem.

I found some useful code in github the coude use matplotlib for making PDF. And rdkit has drawing function for MolToImage, hummmm it seems fun! I tried to make PDF report maker with rdkit and matplotlib.

Following code is just a simple example but by using the same approach, we can make report with many kinds of plot / table and compound structure.

OK, let’s dive to code! At first import library.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import SimilarityMaps
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pgf import PdfPages
import os
from PIL import Image
import io
import numpy as np

Then off the matplotlib interactive plot function and load sample molecules. And define simmap function which calculates similarity map and weights and return the structure as PIL image object. The function makes similarity map of refmol and probmol.

mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf'))]
for m in mols:
fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]
refmol = mols[0]
probmol = mols[1]
def simmap(refmol, probmol):
    im, score = SimilarityMaps.GetSimilarityMapForFingerprint(refmol, 
    bio = io.BytesIO()
    im.savefig(bio, bbox_inches='tight', dpi=200)
    im =
    return im, score

Next get similarity map image. I use subplt2grid function later. So I need to get image object with GetSimilarityMapForFingerprint function at first, because GetSimilarityMapForFingerprint use matplotlib.pyplot object. So plt information is reseted at each calling the function.

im1, _ = simmap(mols[0], mols[0])
im2s, scores = [], []
for i, m in enumerate(mols[:5]):
    im2, score = simmap(mols[0], m)

Almost there, let’s make report! Following example make pdf report with similarity map and some molecular property table. And save it as PDF format.

fig = plt.figure(figsize=(8.27, 11.69)) #A4 size
pdf_pages = PdfPages('report.pdf')

for i, m in enumerate(mols[:5]):
    ax1 = plt.subplot2grid((5,3),(i,0))
    im1 = im1.resize((400,400))
    ax1.imshow(im1, interpolation="catrom")
    ax2 = plt.subplot2grid((5,3),(i,1))
    ax2.imshow(im2s[i], interpolation="catrom")
    ax3 = plt.subplot2grid((5,3),(i,2))
    ax3.table(cellText=[["SimScore",f"{scores[i]:.2}"]],bbox=(0,0,1,1), cellLoc="center")
pdf_pages.savefig(fig, dpi=200)

By using same manner, it is easy to make table which has more molecular information such as biological activity, physchem prop etc. And also easy to add different type of plot such as scatter, bar, radar etc.

rdkit and matplotlib will be nice combi for chemoinformatics.

I pushed today’s code on following URL ;).

Draw RDKit mol/reaction object on HTML without static png image #RDKit #memo

I think this post might not useful for many people. Just for my memorandum.

When I make web app, I use draw SVG function for rdkit object rendering. But from last my post and Greg’s gist, I could learn draw png image without writing static png files. So I wonder that can I draw PNG image on HTML without png file.

Fortunately I could find the solution!

To do that, at first draw rdkit object and convert it png byte text.
Then encode base64 and decode UTF8, that’s all!

Generated text can render in HTML with <img src=’data:image/png;base64> tag.

Today’s code is below. ;)

Edit reaction image and draw it #RDKit #memo

I often use rdkit on jupyter notebook because notebook can render molecules very conveniently. However I couldn’t find way to edit font size of reaction some days ago.

Following code is an example. Changing atom font size of MolToImage is easy but reaction image can’t apply same manner.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdMolDraw2D
from rdkit.Chem import rdChemReactions as Reactions
from rdkit.Chem.Draw import IPythonConsole
from PIL import Image

mol = Chem.MolFromSmiles('c1ccncc1')
rxn = Reactions.ReactionFromSmarts('CC(=O)C>>CC(O)C', useSmiles=True)

Image from default settings

OK, next tried to change font size.

Draw.DrawingOptions.atomLabelFontSize = 30


mol object was drawn with changed font size but reaction object wasn’t.

rxn object was drawn with small font size

I found the answer to solve the issue. It was enable by using MolDraw2DCairo and change its settings. It is not efficient way because png file was required. But I could not find any other solution now.

drawer = rdMolDraw2D.MolDraw2DCairo(800, 200)
im ='test.png')

drawer = rdMolDraw2D.MolDraw2DCairo(800, 200)
im ='test2.png')

As you see, atom font size was changed. ;)

Today’s code was uploaded to my gist.

Any advice will be great fully appreciated. Have a nice weekend!!!


Greg showed me how to render PNG image without drawing temporal file.
Thanks so much!

I have some idea to using the method. I’ll post it after tried it.

Partial charge visualization with RDKit #RDKit #QuantumChemistry #psikit

Last post, I showed how to visualize each fingerprint contribution for predictive model results with RDKit.

The function also can visualize atomic properties such as a partial charge of atoms.

Dr. Thomas Evangelidis shared me an example of visualization of quantum chemistry based atom property. He showed an example of conformation effect to the QM calculation.

He tried to calculate PM-6 semi empirical partial charge and draw it with rdkit. He used MOPAC as QM engine. Following figure shows partial charge of same molecule with different conformation. Red arrow shows difference of each molecule.

And also by using partial change from 100 conformers data, he showed how to visualize std and mean of the partial charge.

Mean of partial charge
std of two conformers partial charge

From above image, conjugated aromatic system shows large difference between two conformers. It seems reasonable I think.
You can check whole his code on following URL.

Back to the latest rdkit blog post, Greg showed an example of visualization of partial charge with Extended Huckel method which is newly implemented on RDKit!

I had interest the approach so I tried to use psikit for partial charge calculation.

Following code is almost same as original blog post but little difference, just using psikit.

Current version of psikit can calculate mulliken charge, esp, resp charge and lowdin charge very conveniently. But it took longer time for calculation compared to rdEHTools method.

code on my gist

psikit seems more sensitive to aromatic atoms. I think it is needed to select method in case by case. But rdkit’s rdEHTools works very fast so it is useful tool of light weight QM calc.

Finally thanks Dr. Thomas again for sharing nice code example!

Model interporation with new drawing code of RDKit #RDKit #Machine learning #chemoinformatics

Following code does not use new drawing code but it revised one of my old post. :)

I think, everyone who visits my blog has already read Gregs nice blog post about the drawing similarity map with new code. If you don’t read it I recommend to read it soon. URL is below. ;)

Now I like the drawing code because it can draw only atom property of own molecule but also compare molecular similarity and fingerprint contribution of predictive model.

Interpolation of the machine learning model is very important because it is use for next design of the new molecule.

So let’s write code with new drawing code. Following code is almost same as RDKit cook book and Greg’s blog post. Import some packages and check the version of rdkit. I used the newest version.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from functools import partial
from rdkit import Chem
from rdkit import rdBase
from rdkit import RDPaths
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem.Draw import SimilarityMaps, IPythonConsole
from IPython.display import SVG
import io
from PIL import Image
import rdkit
import numpy as np
from sklearn.ensemble import RandomForestClassifier

Load solubility data which is provided from rdkit data folder for example.

trainpath = os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.train.sdf')
testpath =  os.path.join(RDPaths.RDDocsDir,'Book/data/solubility.test.sdf')
train_mols = [m for m in Chem.SDMolSupplier(trainpath) if m is not None]
test_mols = [m for m in Chem.SDMolSupplier(testpath) if m is not None]
val_dict = {'(A) low':0,
           '(B) medium':1,
           '(C) high':2}

Then define some helper function.

def mol2arr(mol, fpfunc):
    fp = fpfunc(mol)
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr
def show_png(data):
    bio = io.BytesIO(data)
    img =
    return img
fpfunc = partial(SimilarityMaps.GetMorganFingerprint, nBits=1024, radius=2)

Then calculate fingerprint and get solubility class. And make randomforest classifier object and fit it. It’s very common way for sunday chemoinformatician. ;)

trainX = [fpfunc(m) for m in train_mols]
trainY = [val_dict[m.GetProp('SOL_classification')] for m in train_mols]
testX = [fpfunc(m) for m in test_mols]
testY = [val_dict[m.GetProp('SOL_classification')] for m in test_mols]
rfc = RandomForestClassifier(random_state=794), trainY)

Next define getProba function the code is borrowed from rdkit cookbook. One difference to cookbook is I used third probability because I would like to visualize effect of the each fingerprint to the high solubility. And defined draw function with some optional arguments. ‘alpha’ is useful because it can control the blending value for the contour lines. It is key for beautiful visualization.

# Following example I would like to get probability of High solubility
def getProba(fp, predctionFunction):
    return predctionFunction((fp,))[0][2]
def drawmol(mols, idx):
    d = Draw.MolDraw2DCairo(1,1)
    _, maxWeight = SimilarityMaps.GetSimilarityMapForModel(mols[idx],fpfunc, 
                                                           lambda x: getProba(x, rfc.predict_proba),

Now almost there. Let’s visualize High and Low solubility molecules with the drawing code!

high_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(C) high']
low_test_mols = [mol for mol in test_mols if mol.GetProp('SOL_classification') == '(A) low']
drawmol(high_test_mols, 0)

Hmm good, the model detects hydroxyl group is positive effect to solubility. How about second example?

drawmol(high_test_mols, 10)

OK, carbonyl group detected as positive group for solubility.

drawmol(low_test_mols, 5)

In this case it is difficult because all atoms are lipophilic. So remove any atoms are positive effect to solubility in my understanding.

The code investigates the effect of the each atom by removing one by one. And calculate difference of probability of prob molecule and atom remove molecule.

So, the magnitude of color shows relative effect of the each atom not absolute effect. This is reason why low solubility molecule shows many positive (red) colors in this example.

Any way, model interporation with 2D color mapping on the molecule is quite nice tool for QSAR analysis.

I would like to write more example codes.

Today’s code can be found on my gist and github.

Any comments suggestions and requests will be greatly appreciated. ;)


Handling solid-containing photochemical reactions #memo #organic_chemistry

Happy new year! This is the first post of 2020. ;-)
End of the last year I found some interesting articles and this is one of them. The URL is below.

This research describes new reaction device of photohemical reactions. As you know, photochemical reaction is useful for not only organic chemistry but also medicinal chemistry. Because it can be able to make sp3-sp3, sp2-sp3 and C-N bond efficiently. However it is difficult to conduct the reaction which requires insoluble reagents and/or generates insoluble materials during the reaction. Clogging is major problem of flow chemistry, so heterogeneous reaction is challenging chemistry.

To solve the issue, the author developed new continuous stirred-tank reactor(CSTR) cascade. Fig2 in the article shows image of the reactor. They placed slurry pump on top of the reactor to use gravity for efficient slurry pumping.

And they analyzed residence time of two CSTR which has different reactor diameter.

At first their trial of Ni-Ir catalyzed cross coupling showed low yield with long residence time even if they found that yield correlates residence time in pre-study.

But finally the author optimized reaction condition and achieved compatible yield to batch reaction! The reaction solvent is key factor.

In summary they developed new CSTR platform for handling heterogeneous photochemical reaction in flow.

I surprised that I found one of the author in article of artificial intelligence. I think it’s an ideal research style. I love wet and dry combination of chemistry!