Chemical space visualization and clustering with HDBSCAN and RDKit #RDKit

I caught a flu last Monday. So I stay home and rest in this week…… 😦
I’m having a fever and hope will get better soon.
BTW, recently I found new technique for dimension reduction called Uniform Manifold Approximation and Projection (UMAP). It was also topics in my twitter’s TL.
URL links of original paper and github repository are below.
The repository provides example notebook how does UMAP works. And it is very easy to perform the analysis. I tried to use UMAP with almost default settings and to perform clustering with the data.
To perform the clustering I used HDBSCAN today it is based on DBSCAN. DBSCAN is density based clustering method and it is not required number of clusters for input. It is main difference for other clustering method.
DBSCAN is implemented Scikit-learn so it is easy to perform it.
One of difficulty of DBSCAN is parameter selection and the handling of variable density clusters.
In Fig1 of the original paper[*] shows difference between k-means, DBSCAN and HDBSCAN. DBSCAN can not detect some clusters but HDBSCAN can detect them it is interesting for me.
** [DBSCAN original paper]
HDBSCAN is hierarchical clustering algorithm but it works so fast. It is good point of the algorithm. It uses ‘runt pruning’ with parameter m ‘minimum cluster size’. Any branch of the cluster tree that represents a cluster of less than the size is pruned out of the tree.

Good news! HDBSCAN is python package for unsupervised learning to find clusters. So you can install HDBSCAN via pip or conda. ⭐️

Now move to code. I used GSK3b inhibitor as dataset and each Fingerprint was calculated with RDKit MorganFP.
Then perfomed tSNE and UMAP with original metrics ‘Tanimoto dissimilarity’.
@number.njit() decorator accelerates calculation speed.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.manifold import TSNE
import umap
plot_kwds = {'alpha':0.5, 's':80, 'linewidth':0}

#### snip ####
#### load SDF and calculate fingerprint step ####
#### snip ####

import numba

def tanimoto_dist(a,b):
    dotprod =,b)
    tc = dotprod / (np.sum(a) + np.sum(b) - dotprod)
    return 1.0-tc

tsne = TSNE(n_components=2, metric=tanimoto_dist)
tsne_X = tsne.fit_transform(X)
umap_X = umap.UMAP(n_neighbors=8, min_dist=0.3, metric=tanimoto_dist).fit_transform(X)

Next do HDBSCAN. It is simple just call hdbscan.HDBSCAN and fit data! πŸ˜‰

import hdbscan
cluster_tsne = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
cluster_umap = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)

After clustering the object can make spanning tree plot by simple coding.


Image like ..

And make scatter plot with tSNE and UMAP data.

plt.scatter(tsne_X.T[0], tsne_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below
plt.scatter(umap_X.T[0], umap_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below



I think results of UMAP and HDBSCAN are dependent on parameters but both library is easy to implement with several parameters. Nice example is provided from original repository.

And my whole code is pushed to my repository.


mol2vec analogy of word2vec #RDKit

Last year researcher who is in Bio MedX published following article.
And recently the article was published from ACS.
The concept of mol2vec is same as word2vec.
Word2vec converts word to vector with large data set of corpus and showed success in NLP. Mol2Vec converts molecules to vector with ECFP information.
Fortunately Mol2Vec source code is uploaded to github.
I played mol2vec by reference to original repository.
mol2vec wraps gensim that is python library for topic modeling so user can make corpus and model very easily.
At first I make corpus and model by using very small dataset.
*github repo. provides model which is made from ZINC DB. Following code is sample of how to make model in mol2vec.

from mol2vec.features import DfVec, mol2alt_sentence, mol2sentence, generate_corpus, insert_unk
from mol2vec.features import train_word2vec_model
generate_corpus("cdk2.sdf", "cdk2.corp", 1 )
#I do not need to make unk file because I used same data for analysis.
#insert_unk('cdk2.corp', 'unk_cdk2.corp')
train_word2vec_model('cdk2.corp', 'cdk2word2vec.model', min_count=0)

Cool! Easy to make model.
Next I conduct tSNE analysis and visualize data.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import word2vec
from rdkit import Chem
from rdkit.Chem import Draw
from mol2vec.features import mol2alt_sentence, MolSentence, DfVec, sentences2vec
from mol2vec.helpers import depict_identifier, plot_2D_vectors, IdentifierTable, mol_to_svg

#from supplied data (skip gram word size of 10, radi 1, UNK to replace all identifiers that appear less than 4 time)
#model = word2vec.Word2Vec.load('model_300dim.pkl')
model = word2vec.Word2Vec.load('cdk2word2vec.model')
print("****number of unique identifiers in the model: {}".format(len(model.wv.vocab.keys())))
# get feature vec of 2246728737

mols = [ mol for mol in Chem.SDMolSupplier("cdk2.sdf")]

# encode molecules as sentence(represented by MorganFP)
sentences = [ mol2alt_sentence(mol, 1) for mol in mols ]
flat_list = [ item for sublist in sentences for item in sublist ]
mol_identifiers_unique = list(set( flat_list ))
print("mols has {} identifiers and has {} unique identifiers".format(len(flat_list), len(mol_identifiers_unique) ))
# make projection of 300D vectors to 2D using PCA/TSNE combination
df_vec = pd.DataFrame()
df_vec['identifier'] = mol_identifiers_unique
df_vec.index = df_vec['identifier']

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

pca_model = PCA(n_components=30)
#tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='cosine')
# I use sklearn v0.19, metric = 'cosine' did not work see following URL / SOF.
tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='euclidean')

tsne_pca = tsne_model.fit_transform( pca_model.fit_transform([model.wv.word_vec(x) for x in mol_identifiers_unique]))

df_vec['PCA-tSNE-c1'] = tsne_pca.T[0]
df_vec['PCA-tSNE-c2'] = tsne_pca.T[1]
projections = df_vec.to_dict()
# Function that extracts projected values for plotting
def get_values(identifier, projections):
    return np.array((projections['PCA-tSNE-c1'][str(identifier)],projections['PCA-tSNE-c2'][str(identifier)]))

# substructure vectors are plotted as green and molecule vectors is plotted as cyan.

mol_vals = [ get_values(x, projections) for x in sentences[0]]
subplt = plot_2D_vectors( mol_vals )

plot_2D_vectors can visualize vector of each fingerprint and molecule.
I often use PCA to analyze of chemical space but results of PCA depends on dataset. But Mol2Vec make model first and apply data to model. It means Mol2vec analyze different dataset on the same basis(model).
My example code is very simple and shows a small part of Mol2Vec.
BTW, word2vec can calculate word as vec for example tokyo – japan + paris = france.
How about Mol2Vec ? mol A – mol B + mol C = mol D???? Maybe it is difficult because Finger print algorithm uses Hash function I think.
Reader who is interested in the package, please visit official repository and enjoy. πŸ˜‰

Simple way for making SMILES file #RDKit

To convert SDF to SMILES I write like a following code.

sdf = Chem.SDMolSupplier( 'some.sdf' )
with open('smiles.smi', 'w') as f:
    for mol in sdf:
        smi = Chem.MolToSmiles(mol)

In this way, to write smiles strings with properties it is needed to get properties by using GetProp(“some prop”). If I need several properties my code tend to be long.
Greg who is developer of RDKit advised me to use SmilesMolWriter. πŸ˜‰

I have never used this function so I used it and found it was very useful.
Let me show the example. (copy & paste from Jupyter notebook)

import pprint
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem.rdmolfiles import SmilesWriter

mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None]
# make writer object with a file name.
writer = SmilesWriter('cdk2smi1.smi')
#Check prop names.

#SetProps method can set properties that will be written to files with SMILES.
#The way of writing molecules can perform common way.
for mol in mols:
    writer.write( mol )

Then check the file.

!head -n 10 cdk2smi1.smi
SMILES Name Cluster
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 2
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 2

How about set all props ?

writer = SmilesWriter('cdk2smi2.smi')
for mol in mols:
    writer.write( mol )
!head -n 10 cdk2smi2.smi
SMILES Name id Cluster MODEL.SOURCE MODEL.CCRATIO r_mmffld_Potential_Energy-OPLS_2005 r_mmffld_RMS_Derivative-OPLS_2005 b_mmffld_Minimization_Converged-OPLS_2005
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 ZINC03814457 1 CORINA 3.44 0027  09.01.2008 1 -78.6454 0.000213629 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 ZINC03814459 2 CORINA 3.44 0027  09.01.2008 1 -67.4705 9.48919e-05 1
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 ZINC03814460 2 CORINA 3.44 0027  09.01.2008 1 -89.4303 5.17485e-05 1
Nc1nc(OCC2CCCCC2)c2nc[nH]c2n1 ZINC00023543 ZINC00023543 3 CORINA 3.44 0027  09.01.2008 1 -70.2463 6.35949e-05 1

Wow it works fine. This method is useful and easy to use.
Do most of the RDKiters use the function?
I’m totally out of fashion!!!

I pushed my Fast Clustering code to rdkit/Contrib repository. It was big news for me.

Build QSAR model with pytorch and rdkit #RDKit

There are many frameworks in python deeplearning. For example chainer, Keras, Theano, Tensorflow and pytorch.
I have tried Keras, Chainer and Tensorflow for QSAR modeling. And I tried to build QSAR model by using pytorch and RDKit.
You know, pytorch has Dynamic Neural Networks “Define-by-Run” like chainer.
I used solubility data that is provided from rdkit and I used the dataset before.

Let’s start coding.
At first I imported package that is needed for QSAR and defined some utility functions.

import pprint
import argparse
import torch
import torch.optim as optim
from torch import nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
#from sklearn import preprocessing

def base_parser():
    parser = argparse.ArgumentParser("This is simple test of pytorch")
    parser.add_argument("trainset", help="sdf for train")
    parser.add_argument("testset", help="sdf for test")
    parser.add_argument("--epochs", default=150)
    return parser

parser = base_parser()
args = parser.parse_args()
traindata = [mol for mol in Chem.SDMolSupplier(args.trainset) if mol is not None]
testdata = [mol for mol in Chem.SDMolSupplier(args.testset) if mol is not None]

def molsfeaturizer(mols):
    fps = []
    for mol in mols:
        arr = np.zeros((0,))
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        DataStructs.ConvertToNumpyArray(fp, arr)
    fps = np.array(fps, dtype = np.float)
    return fps

classes = {"(A) low":0, "(B) medium":1, "(C) high":2}
#classes = {"(A) low":0, "(B) medium":1, "(C) high":1}

trainx = molsfeaturizer(traindata)
testx = molsfeaturizer(testdata)
# for pytorch, y must be long type!!
trainy = np.array([classes[mol.GetProp("SOL_classification")] for mol in traindata], dtype=np.int64)
testy = np.array([classes[mol.GetProp("SOL_classification")] for mol in testdata], dtype=np.int64)

torch.from_numpy function can convert numpy array to torch tensor. It is very convenient for us.
And then I defined neural network. I feel this method is very unique because I mostly use Keras for deep learning.
To build the model in pytorch, I need define the each layer and whole structure.

X_train = torch.from_numpy(trainx)
X_test = torch.from_numpy(testx)
Y_train = torch.from_numpy(trainy)
Y_test = torch.from_numpy(testy)
print(X_test.size(), Y_train.size())

class QSAR_mlp(nn.Module):
    def __init__(self):
        super(QSAR_mlp, self).__init__()
        self.fc1 = nn.Linear(2048, 524)
        self.fc2 = nn.Linear(524, 10)
        self.fc3 = nn.Linear(10, 10)
        self.fc4 = nn.Linear(10,3)
    def forward(self, x):
        x = x.view(-1, 2048)
        h1 = F.relu(self.fc1(x))
        h2 = F.relu(self.fc2(h1))
        h3 = F.relu(self.fc3(h2))
        output = F.sigmoid(self.fc4(h3))
        return output

After defining the model I tried to lean and prediction.
Following code is training and prediction parts.

model = QSAR_mlp()

losses = []
optimizer = optim.Adam( model.parameters(), lr=0.005)
for epoch in range(args.epochs):
    data, target = Variable(X_train).float(), Variable(Y_train).long()
    y_pred = model(data)
    loss = F.cross_entropy(y_pred, target)
    print("Loss: {}".format([0]))

pred_y = model(Variable(X_test).float())
predicted = torch.max(pred_y, 1)[1]

for i in range(len(predicted)):
    print("pred:{}, target:{}".format([i], Y_test[i]))

print( "Accuracy: {}".format(sum(p==t for p,t in zip(, Y_test))/len(Y_test)))

Check the code.

iwatobipen$ python solubility.train.sdf solubility.test.sdf 
torch.Size([1025, 2048]) torch.Size([1025])
torch.Size([257, 2048]) torch.Size([1025])
  (fc1): Linear(in_features=2048, out_features=524)
  (fc2): Linear(in_features=524, out_features=10)
  (fc3): Linear(in_features=10, out_features=10)
  (fc4): Linear(in_features=10, out_features=3)
Loss: 1.1143544912338257
Loss: 0.6231405735015869
pred:1, target:0
pred:1, target:0
pred:0, target:0
Accuracy: 0.642023346303502

Hmm, accuracy is not so high. I believe there is still room for improvement. I am newbie of pytorch. I will try to practice pytorch next year.

This is my last code of this year. I would like to post my blog more in next year.
If readers who find mistake in my code, please let me know.

Have a happy new year !!!!

Ultra fast clustering script with RDKit #RDKit

Some years ago, I got very useful information for molecular clustering. Bayon is ultra fast clustering tool. The author made not only Japanese-tutorial but also English-tutorial.
This tools is easy to use but to use bayon in chemoinformatics area, user needs data preparation.

I wrote simple script that converts smiles to bayon input format and then perform clustering and parse output data.
Main code is following.

import argparse
import subprocess
import pickle
import os
from rdkit import Chem
from rdkit.Chem import AllChem

parser = argparse.ArgumentParser( "Fast clustering for chemoinformatics" )
parser.add_argument( "input" )
parser.add_argument( "n", help = "the number of clusters" )
#parser.add_argument( "-l", help = "limit value of cluster bisection" )
#parser.add_argument( "-p", help = "output similarity points" )
parser.add_argument( "--output", default = "clustered.tsv" )
parser.add_argument( "-c", help = "filename of centroid", default = "centroid.tsv" )

def smi2fp( molid, smiles ):
    mol = Chem.MolFromSmiles( smiles )
    onbits = AllChem.GetMorganFingerprintAsBitVect( mol, 2 ).GetOnBits()
    row = molid
    for bit in onbits:
        row += "\tFP_{}\t1.0".format( bit )
    row += "\n"
    return row

if __name__ == "__main__":
    args = parser.parse_args()
    inputf = open( args.input, "r" )
    n = args.n
    c = args.c
    output = args.output
    tempf = open( "fp.tsv", "w" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        tempf.write( smi2fp( line[0], line[1] ))
    res = "time bayon -p -c {} -n {} fp.tsv > {}".format( c, n, output ), shell=True )

    #parse results
    parsefile = open( output.split(".")[0]+"_parse.tsv", "w" )
    inputf = open( output, "r" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        cluster_id = line[0]
        for i in range( 1, len( line )-1, 2 ) :
            molid = line[ i ]
            point = line[ i + 1 ]
            parsefile.write("{}\t{}\tCLS_ID_{}\n".format( molid, point, cluster_id ))

The code perform clustering molecules and output cluster with point ( similarity ) and parse default bayon format.

I ran the code with rdkit cdk2.sdf data.
47 compound clustered into 5 clusters within 0.006s!

iwatobipen$ python cdk2.smi 5

real	0m0.015s
user	0m0.006s
sys	0m0.002s

It seems work fine. πŸ˜‰

Calculate USRCAT with RDKit #RDKit

Some years ago, I posted blog about USRCAT.
USRCAT is shape based method like ROCS. And it works very fast. The code was freely available but to use the code, user need to install it.
But as you know, new version of RDKit implements this function! That is good news isn’t it.
I tried the function just now.
Source code is following.

import os
import seaborn as sns
import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdBase
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem.rdMolDescriptors import GetUSRScore, GetUSRCAT
from rdkit.Chem import DataStructs
print( rdBase.rdkitVersion )
path = os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )

mols = [ mol for mol in Chem.SDMolSupplier( path ) ]
for mol in mols:
    AllChem.EmbedMolecule( mol, 
                           useExpTorsionAnglePrefs = True,
                           useBasicKnowledge = True )
usrcats = [ GetUSRCAT( mol ) for mol in mols ]
fps = [ AllChem.GetMorganFingerprintAsBitVect( mol, 2 ) for mol in mols ]

data = { "tanimoto":[], "usrscore":[] }

for i in range( len( usrcats )):
    for j in range( i ):
        tc = DataStructs.TanimotoSimilarity( fps[ i ], fps[ j ] )
        score = GetUSRScore( usrcats[ i ], usrcats[ j ] )
        data["tanimoto"].append( tc )
        data["usrscore"].append( score )
        print( score, tc )
df = pd.DataFrame( data )

fig = sns.pairplot( df )
fig.savefig( 'plot.png' )

Run the code.

iwatobipen$ python
# output
0.4878222403055059 0.46296296296296297
0.2983740604270427 0.48148148148148145
0.36022943735904756 0.5660377358490566
0.3480531986117265 0.5
0.3593106395905704 0.6595744680851063
0.25662588527525304 0.6122448979591837
0.18452571918677163 0.46296296296296297
0.18534407651655047 0.5769230769230769
0.1698894448811921 0.5660377358490566
0.19927998441539707 0.6956521739130435
0.2052241644475582 0.15714285714285714
0.21930710455068858 0.10526315789473684
0.21800341857284924 0.1038961038961039

Tanimoto coeff and USRScore showed different score ( 2D vs 3D pharmacophore ). I think USRScore provides new way to estimate molecular similarity.

RDKit is really cool toolkit. I love it. πŸ˜‰

Draw high quality molecular image in RDKit #rdkit

Recently, I want to draw high quality image molecule using RDKit. Older version of RDKit png image is not enough for me.
I found the solution in RDKit discuss. The discussion recommended to install cairocffi.
I installed cairocffi via conda.

iwatobipen$ conda install -c conda-forge cairocffi

But… Result is not enough for me. ( this case is my Mac environment. Linux environment worked fine. )

Next I tried to conversion of SVG to PNG. Fortunately, I found cairosvg and the package can convet svg to PNG.
Following code is example.
And I found tips for RDKitters!.
DrawingOptions can modify the drawing settings. For example, font size, bond width etc.

import argparse
import cairosvg
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import DrawingOptions

parser = argparse.ArgumentParser( 'smiles to png inmage' )
parser.add_argument( 'smiles' )
parser.add_argument( '--filename', default="mol." )
DrawingOptions.atomLabelFontSize = 55
DrawingOptions.dotsPerAngstrom = 100
DrawingOptions.bondLineWidth = 3.0 

parser.add_argument( 'smiles' )

if __name__=='__main__':
    param = parser.parse_args()
    smiles = param.smiles
    fname = param.filename
    mol = Chem.MolFromSmiles( smiles )
    Draw.MolToFile( mol, fname+"png" )
    Draw.MolToFile( mol, "temp.svg" )
    cairosvg.svg2png( url='./temp.svg', write_to= "svg_"+fname+"png" )

The code generate png image not only directly from smiles but also from svg.
Here is result.

iwatobipen$ python smiles "CC1=CC=C(C=C1)C2=CC(=NN2C3=CC=C(C=C3)S(=O)(=O)N)C(F)(F)F"

Direct png.


Image from SVG is high quality I think. πŸ˜‰