(new?) medchem tool box for compound synthesis

This mini perspective shows recent progress of the direct C-H alkylation with Alkyl Sulfinates.

There are many heterocyclic moieties such as pyridine, pyrrole etc. in drug like molecules.
The C-H alkylation reaction of heterocycles is useful but difficult to conduct it under mild reaction conditions.
So mild and universal reaction is very attractive for chemists I think.

In the article, Phil S. Baran’s group reported many examples of their developed reaction.
They used alkyl sulfinates as a radical precursors and developed many kind of commercial sulfinates, fluoroalkyl, heterocyclic, alkyl, aromatic and linker-type.

Surprisingly most of the reactions proceed room temperature. Also they shows reactivity guid lines, it is very practical!

And also this reaction can apply not only early stage of synthesis but also late stage of synthesis.
It means that this is very specific reaction.
I had few successful cases C-H alkylation with sulfinate…. I would like to use the reaction condition when I have chance.


Draw similarity network #RDKit #Cyjupyter

Recently Kei Ono who is developer of cytoscape developed cyjupyter.
It seems attractive for me because the library can draw network diagram on jupyter notebook.
There are many network structured data in chemoinformatics. For example molecule, molecular similarity map and MMP etc… I used the library to draw similarity map of molecules today.
I am newbie of the library, so following code is very simple but there are several useful examples are provided in official repository.
At first, load modules.

import os
import numpy as np
import igraph
from py2cytoscape import util
from cyjupyter import Cytoscape
from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

I used CDK2.sdf as a sample dataset

filedir = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf')
mols = [mol for mol in Chem.SDMolSupplier(filedir) if mol != None]
for mol in mols:
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]
smiles_list = [Chem.MolToSmiles(mol) for mol in mols]

Then make graph object, node as an each molecule and make edge if tanimoto similarity more 0.5.

g = igraph.Graph()
for smiles in smiles_list:
for i in range(len(mols)):
    for j in range(i):
        tc = DataStructs.TanimotoSimilarity(fps[i], fps[j])
        if tc >= 0.5:
            g.add_edge(smiles_list[i], smiles_list[j])

Finally convert graph object to json by using py2cytoscape and Draw graph with default settings.

graph_data = util.from_igraph(g)

I could get following image.

And check network

Cyjupyter is useful network drawing tool for jupyter notebook user. I would like to check the way to control visualization.
My example code is uploaded my repository. URL is below.

New fingerprint/MinHash FingerPrint #RDKit #Chemoinformatics

Recently I found an article that describe new method for fast fingerprint calculation.
You can read the article from chemrxiv, URL is below.
They used MinHash method.
MinHash method is the way to estimate jaccard similarity very efficiently. The authors developed MHFP (MinHash Fingerprint) and compared the performance with ECFP4.
? MinHash ?
for example..
They discussed the performance of MFHP6 (6 means radius 3) and the FP generally outperforms MHFP4, ECFPxs.
In fig6. shows performance analysis of k-nearest neighbor search and MHFP6 works very nice and fast.

Fortunately, author disclosed source code on github. You can use it if you would like to use it.

Now I tried to use it and compared similarity between ECFP and MHFP.
Code is below.

@jupyter notebook
Load packages.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from mhfp.encoder import MHFPEncoder
mhfp_encoder = MHFPEncoder()

Calculate fingerprints!

mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None]
nmols = len(mols)
#Calc morgan fp
mg2fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 3) for mol in mols]
#Calc min hash fp
mhfps = [mhfp_encoder.encode_mol(mol) for mol in mols]

Check them!

tanimoto_sim = []
for i in range(nmols):
    for j in range(i):
        tc = DataStructs.TanimotoSimilarity(mg2fps[i], mg2fps[j])
mhfps_sim = []
for i in range(nmols):
    for j in range(i):
        jaccard = 1. - MHFPEncoder.distance(mhfps[i], mhfps[j])
a, b = np.polyfit(tanimoto_sim, mhfps_sim, 1)
y2 = np.int64(a) * tanimoto_sim + np.int64(b)
print(a, b)
> 1.033917242502858 -0.031604772419224866

This results shows ECFP6 and MHFP6 has good correlation I think.
Finally I made a simple scatter plot.

plt.scatter(tanimoto_sim, mhfps_sim)
plt.plot(tanimoto_sim, y2, color='black')
plt.ylabel('mhfp sim')

All code is pushed to my repo.

In summary, I tried to use MHFP and it shows good correlation with ECFP.
I used very small dataset(47 molecules), so it can not check speed for large dataset.
I would like to check it near the future.

Last week, I participated CBI and a software UGM.
I am happy that I could have fruitful discussions. I could get many ideas for next challenge!


I like my town. This town is comfortable for me to live in, because it is not too urban like Tokyo or rural.

There are many beautiful place and following pictures are my favorite place.  The water in this river is very clean. I can see firefly in summer around here.

I want to  this scenery to continue forever.

After walk, I went to see a doctor, my finger is getting well. I hope my finger get well soon…

3D Alignment function of RDKit #RDKit

During the UGM, I was interested in Ben Tehan & Rob Smith’s great work.
They showed me a nice example of molecular alignment with RDKit.
RDKit has several function to perform 3D alignment. In the Drug Discovery 3D alignment of ligands is important not only Comp Chem but also Med Chem. After their presentation, I talked them and they told me that GetCrippenO3A is useful for 3D alignment.
Hmm, that’s sounds interesting.
I tried to use the function.
My example code is below. Following code run on ipython notebook. To visualize 3D structure of molecules, I used py3Dmol. It can visualize multiple 3D molecules and easy to use!

import py3Dmol
import copy
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import rdBase
from rdkit.Chem import rdMolAlign
from rdkit.Chem import rdMolDescriptors
import numpy as np
p = AllChem.ETKDGv2()
p.verbose = True

If p.verbose set True, user can get SMARTS patterns which hit the definition of ETKDG like below.

[cH0:1]c:2!@;-[O:3][C:4]: 8 7 6 5, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 3 5 6 7, (0, 0, 4, 0, 0, 0)
[O:1][CX4:2]!@;-[CX3:3]=[O:4]: 6 5 3 4, (0, 3, 0, 0, 0, 0)
[C:1][CX4:2]!@;-[CX3:3]=[O:4]: 0 1 3 4, (0, 0, 1, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 12 11 10 5, (0, 0, 4, 0, 0, 0)
[OX2:1][CX4:2]!@;-[CX4:3][OX2:4]: 10 11 12 16, (0, 0, 3, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 12 11 10 5, (0, 0, 4, 0, 0, 0)
[OX2:1][CX4:2]!@;-[CX4:3][N:4]: 10 11 12 17, (0, 0, 4, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)

Next load molecules and generate conformers. I used cdk2.sdf which is provided in rdkit as sample.

mols = [m for m in Chem.SDMolSupplier('cdk2.sdf') if m != None][:6]
for mol in mols:
hmols_1 = [Chem.AddHs(m) for m in mols]
hmols_2 = copy.deepcopy(hmols_1)
# Generate 100 conformers per each molecule
for mol in hmols_1:
    AllChem.EmbedMultipleConfs(mol, 100, p)

for mol in hmols_2:
    AllChem.EmbedMultipleConfs(mol, 100, p)
# for Ipython notebook

To conduct GetCrippenO3A and GetO3A, I calculate crippen_contrib of each atom and MMFF params of molecules.

crippen_contribs = [rdMolDescriptors._CalcCrippenContribs(mol) for mol in hmols_1]
crippen_ref_contrib = crippen_contribs[0]
crippen_prob_contribs = crippen_contribs[1:]
ref_mol1 = hmols_1[0]
prob_mols_1 = hmols_1[1:]

mmff_params = [AllChem.MMFFGetMoleculeProperties(mol) for mol in hmols_2]
mmff_ref_param = mmff_params[0]
mmff_prob_params = mmff_params[1:]
ref_mol2 = hmols_2[0]
prob_mols_2 = hmols_2[1:]

OK Let’s align molecules and visualize them!
I retrieved the best score index from multi conformers of each molecule and added viewer.
For crippenO3A…

p_crippen = py3Dmol.view(width=600, height=400)
p_crippen.addModel(Chem.MolToMolBlock(ref_mol1), 'sdf')
crippen_score = []
for idx, mol in enumerate(prob_mols_1):
    tempscore = []
    for cid in range(100):
        crippenO3A = rdMolAlign.GetCrippenO3A(mol, ref_mol1, crippen_prob_contribs[idx], crippen_ref_contrib, cid, 0)
    best = np.argmax(tempscore)
    p_crippen.addModel(Chem.MolToMolBlock(mol, confId=int(best)), 'sdf')

For O3A…

p_O3A = py3Dmol.view(width=600, height=400)
p_O3A.addModel(Chem.MolToMolBlock(ref_mol2), 'sdf')
pyO3A_score = []
for idx, mol in enumerate(prob_mols_2):
    tempscore = []
    for cid in range(100):
        pyO3A = rdMolAlign.GetO3A(mol, ref_mol2, mmff_prob_params[idx], mmff_ref_param, cid, 0)
    best = np.argmax(tempscore)
    p_O3A.addModel(Chem.MolToMolBlock(mol, confId=int(best)), 'sdf')

In my example, both methods shows good results. To check the details, I will calculate ShapeTanimoto and/or RMSD etc.

In summary, rdkit has many useful functions not only 2D but also 3D. I would like to use this function in my project.

All code of the post, I uploaded my github repo.

Enjoyed RDKitUGM2018 #RDKit

I got back Japan from Cambridge today. Time flies when you’re having fun.
This is the first time I participate RDKit UGM and RDKit Hackathon. It was amazing experience for me.
Twitter Hash Tag was very attractive. If reader who is interested in, I recommend to search #RDKitUGM2018 in twitter.
I could talk face to face many people who I’m following twitter. Thank you for talking with me. ;-)
There were many exiting topics in the meeting. Especially I enjoyed Dr. Segler’s talk about computer aided synthesis planning. It was nice work! His system can analyze retrosynthetic route very efficiently.

And Gregs’s presentation gave me very important message.
Pros of open source is good community and contribution of users. I agree his opinion. I would like to have ab internal discussion about it.

I really surprised that I could meet my blog reader and could get many positive comment about the blog.
I was so happy that I cried….

This blog is memo for myself, but it make me happy that my blog be someone’s help.

I really thank the meeting organizer and participants.
I hope I can meet everyone next year