Make predictive models with small data and visualize it #Chemoinformatics

I enjoyed chemoinformatics conference held in Kumamoto in this week.
The first day of the conference, I could hear about very interesting lecture. That was very basic data handling and visualization tutorial but useful for newbie of chemoinformatics.
I would like to reproduce the code example, so I tried it.

First, visualize training data. It important to know the properties of training data.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
points = [(1.5, 2), (2, 1), (2, 3), (3, 5), (4, 3), (7, 6), (9,  10)]
label = [1, -1, 1, 1, -1, -1, 1]
def color_map(label):
    if label > 0:
        return 'blue'
    return 'red'
train_color = list(map(color_map, label))
# check data
train_x = [ i[0] for i in points]
train_y = [ i[1] for i in points]
plt.scatter(x=train_x, y=train_y, c=train_color)

Hmm, it seems linear relationship between x and y.

Next, made test data and helper function for visualize data.

test_x = np.linspace(0, 10, 20)
test_y = np.linspace(0, 10, 20)
xx, yy = np.meshgrid(test_x, test_y)
test_x = xx.ravel()
test_y = yy.ravel()
n_data = len(test_x)
test_data = [(test_x[i], test_y[i]) for i in range(n_data)]

def makeplot(test_x, test_y, predict_data):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(train_x, train_y, c=train_color)
    color = list(map(color_map, predict_data))
    ax.scatter(test_x, test_y, c = color, alpha=0.3)

OK let’s build model!

# Linear Regression
model = LinearRegression()
predictor =, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

Oh, Simple linear regressor works very well. ;-)
OK, next how about random forest ?

#Random forest
model = RandomForestClassifier(random_state=np.random.seed(794))
predictor =, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

The result is quite different from linear regressor.

Next I checked non linear data.

points = [(1, 5), (2, 10), (2, 3), (3, 5), (5, 4), (7, 6), (9,  10), (11,2), (7, 3)]
label = [1, 1, 1, 1, -1, -1, 1, 1, -1]
def color_map(label):
    if label > 0:
        return 'blue'
    return 'red'

train_color = list(map(color_map, label))
train_x = [ i[0] for i in points]
train_y = [ i[1] for i in points]
plt.scatter(x=train_x, y=train_y, c=train_color)

In this case, linear model does not work well.

model = Ridge()
predictor =, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

How about RF and SVR?

model = RandomForestRegressor(random_state=np.random.seed(794))
predictor =, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

model = SVR()
predictor =, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)



Non linear regressor can fit non linear data but shows different output.
Model selection is important and to select model, it is needed check training data carefully.
All my experiments can check from google colab.

Any comments and suggestions are appreciated.


standardization of tautomers #RDKit

One of the hot topic of new version of RDKit is an integration of MolVS which is tool for molecular standardization.
Molecular standardization is important for not only chemist but also chemoinformatist. Because tautomer shows different representation of molecule and it will be affect accuracy of QSAR models.
I wrote molecular standardization tools named ‘MolVS’ before and MolVS is an another library at the time. Now we can call molvs from native RDKit.
I used 2-hydroxy prydine as an example.

from rdkit import Chem
from rdkit import rdBase
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw


from rdkit.Chem import MolStandardize

smi1 = 'c1cccc(O)n1'
mol1 = Chem.MolFromSmiles(smi1)
smi2 = 'C1=CC(=O)NC=C1'
mol2 = Chem.MolFromSmiles(smi2)

Draw.MolsToGridImage([mol1, mol2])

Same formula but different structure.

Standardization method is very simple.

stsmi1 = MolStandardize.canonicalize_tautomer_smiles(smi1)
stsmi2 = MolStandardize.canonicalize_tautomer_smiles(smi2)

Draw.MolsToGridImage([Chem.MolFromSmiles(stsmi1), Chem.MolFromSmiles(stsmi2)])

Also it is easy to get possible tautomers from a smiles. And MolStandarize class has many method. It is very useful for data preprocessing I think.

tautomers = MolStandardize.enumerate_tautomers_smiles(smi1)
>{'O=c1cccc[nH]1', 'Oc1ccccn1'}

I uploaded the snippet to my repo. It can read from following URL.

I will go to Kumamoto to participate chemoinformatics conference tomorrow. I hope I can have many fruitful discussions.

New finger print calculation method in RDKit #RDKit

It’s a good news for RDKitters!
New version of rdkit is released and it can be installed with Anaconda!
There are many implementations and enhancements. You can find details of that from URL below.

One of interesting feature is a fingerprint bit information rendering function.
There is nice blog post about that.

And I am interested in new finger print calculation method. This is a project of GSoc.
rdFingerprintGenerator can provide finger print generators. The generator can access directory to fingerprint methods.
I used the method and wrote some code…

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import rdFingerprintGenerator
from rdkit import rdBase
from rdkit import RDPaths
import os

Load molecules.

mols = Chem.SDMolSupplier(os.path.join(RDPaths.RDDocsDir,'Book/data/cdk2.sdf'))
mols = [ mol for mol in mols if mol != None]

Get FP generator and calculate FP.

# Default radi of molecule is 3! (ECFP6 like)
morgan_fp_gen = rdFingerprintGenerator.GetMorganGenerator()
fp01 = morgan_fp_gen.GetFingerprint(mols[0])
fp02 = AllChem.GetMorganFingerprintAsBitVect(mols[0],3)

The generator returns molecular finger print as bit vect.
And the generator has several useful method listed below.

> 'GetFingerprint',
> 'GetInfoString',
> 'GetSparseCountFingerprint',

This example is limited to Morgan FP but the method can get other fingerprints as same manner, AtomPair FP, RDKit FP etc.

The official repository provides example code.
I think the code is written in python2 environments. If reader who want to run the code, I recommend replace ‘print xxx’ to ‘print(xxx)’

And I uploaded my snippet to my repo, ULR is below.

(new?) medchem tool box for compound synthesis

This mini perspective shows recent progress of the direct C-H alkylation with Alkyl Sulfinates.

There are many heterocyclic moieties such as pyridine, pyrrole etc. in drug like molecules.
The C-H alkylation reaction of heterocycles is useful but difficult to conduct it under mild reaction conditions.
So mild and universal reaction is very attractive for chemists I think.

In the article, Phil S. Baran’s group reported many examples of their developed reaction.
They used alkyl sulfinates as a radical precursors and developed many kind of commercial sulfinates, fluoroalkyl, heterocyclic, alkyl, aromatic and linker-type.

Surprisingly most of the reactions proceed room temperature. Also they shows reactivity guid lines, it is very practical!

And also this reaction can apply not only early stage of synthesis but also late stage of synthesis.
It means that this is very specific reaction.
I had few successful cases C-H alkylation with sulfinate…. I would like to use the reaction condition when I have chance.

Draw similarity network #RDKit #Cyjupyter

Recently Kei Ono who is developer of cytoscape developed cyjupyter.
It seems attractive for me because the library can draw network diagram on jupyter notebook.
There are many network structured data in chemoinformatics. For example molecule, molecular similarity map and MMP etc… I used the library to draw similarity map of molecules today.
I am newbie of the library, so following code is very simple but there are several useful examples are provided in official repository.
At first, load modules.

import os
import numpy as np
import igraph
from py2cytoscape import util
from cyjupyter import Cytoscape
from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

I used CDK2.sdf as a sample dataset

filedir = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf')
mols = [mol for mol in Chem.SDMolSupplier(filedir) if mol != None]
for mol in mols:
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]
smiles_list = [Chem.MolToSmiles(mol) for mol in mols]

Then make graph object, node as an each molecule and make edge if tanimoto similarity more 0.5.

g = igraph.Graph()
for smiles in smiles_list:
for i in range(len(mols)):
    for j in range(i):
        tc = DataStructs.TanimotoSimilarity(fps[i], fps[j])
        if tc >= 0.5:
            g.add_edge(smiles_list[i], smiles_list[j])

Finally convert graph object to json by using py2cytoscape and Draw graph with default settings.

graph_data = util.from_igraph(g)

I could get following image.

And check network

Cyjupyter is useful network drawing tool for jupyter notebook user. I would like to check the way to control visualization.
My example code is uploaded my repository. URL is below.

New fingerprint/MinHash FingerPrint #RDKit #Chemoinformatics

Recently I found an article that describe new method for fast fingerprint calculation.
You can read the article from chemrxiv, URL is below.
They used MinHash method.
MinHash method is the way to estimate jaccard similarity very efficiently. The authors developed MHFP (MinHash Fingerprint) and compared the performance with ECFP4.
? MinHash ?
for example..
They discussed the performance of MFHP6 (6 means radius 3) and the FP generally outperforms MHFP4, ECFPxs.
In fig6. shows performance analysis of k-nearest neighbor search and MHFP6 works very nice and fast.

Fortunately, author disclosed source code on github. You can use it if you would like to use it.

Now I tried to use it and compared similarity between ECFP and MHFP.
Code is below.

@jupyter notebook
Load packages.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from mhfp.encoder import MHFPEncoder
mhfp_encoder = MHFPEncoder()

Calculate fingerprints!

mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None]
nmols = len(mols)
#Calc morgan fp
mg2fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 3) for mol in mols]
#Calc min hash fp
mhfps = [mhfp_encoder.encode_mol(mol) for mol in mols]

Check them!

tanimoto_sim = []
for i in range(nmols):
    for j in range(i):
        tc = DataStructs.TanimotoSimilarity(mg2fps[i], mg2fps[j])
mhfps_sim = []
for i in range(nmols):
    for j in range(i):
        jaccard = 1. - MHFPEncoder.distance(mhfps[i], mhfps[j])
a, b = np.polyfit(tanimoto_sim, mhfps_sim, 1)
y2 = np.int64(a) * tanimoto_sim + np.int64(b)
print(a, b)
> 1.033917242502858 -0.031604772419224866

This results shows ECFP6 and MHFP6 has good correlation I think.
Finally I made a simple scatter plot.

plt.scatter(tanimoto_sim, mhfps_sim)
plt.plot(tanimoto_sim, y2, color='black')
plt.ylabel('mhfp sim')

All code is pushed to my repo.

In summary, I tried to use MHFP and it shows good correlation with ECFP.
I used very small dataset(47 molecules), so it can not check speed for large dataset.
I would like to check it near the future.

Last week, I participated CBI and a software UGM.
I am happy that I could have fruitful discussions. I could get many ideas for next challenge!


I like my town. This town is comfortable for me to live in, because it is not too urban like Tokyo or rural.

There are many beautiful place and following pictures are my favorite place.  The water in this river is very clean. I can see firefly in summer around here.

I want to  this scenery to continue forever.

After walk, I went to see a doctor, my finger is getting well. I hope my finger get well soon…