Chemical space visualization and clustering with HDBSCAN and RDKit #RDKit

I caught a flu last Monday. So I stay home and rest in this week…… 😦
I’m having a fever and hope will get better soon.
BTW, recently I found new technique for dimension reduction called Uniform Manifold Approximation and Projection (UMAP). It was also topics in my twitter’s TL.
URL links of original paper and github repository are below.
The repository provides example notebook how does UMAP works. And it is very easy to perform the analysis. I tried to use UMAP with almost default settings and to perform clustering with the data.
To perform the clustering I used HDBSCAN today it is based on DBSCAN. DBSCAN is density based clustering method and it is not required number of clusters for input. It is main difference for other clustering method.
DBSCAN is implemented Scikit-learn so it is easy to perform it.
One of difficulty of DBSCAN is parameter selection and the handling of variable density clusters.
In Fig1 of the original paper[*] shows difference between k-means, DBSCAN and HDBSCAN. DBSCAN can not detect some clusters but HDBSCAN can detect them it is interesting for me.
** [DBSCAN original paper]
HDBSCAN is hierarchical clustering algorithm but it works so fast. It is good point of the algorithm. It uses ‘runt pruning’ with parameter m ‘minimum cluster size’. Any branch of the cluster tree that represents a cluster of less than the size is pruned out of the tree.

Good news! HDBSCAN is python package for unsupervised learning to find clusters. So you can install HDBSCAN via pip or conda. ⭐️

Now move to code. I used GSK3b inhibitor as dataset and each Fingerprint was calculated with RDKit MorganFP.
Then perfomed tSNE and UMAP with original metrics ‘Tanimoto dissimilarity’.
@number.njit() decorator accelerates calculation speed.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.manifold import TSNE
import umap
plot_kwds = {'alpha':0.5, 's':80, 'linewidth':0}

#### snip ####
#### load SDF and calculate fingerprint step ####
#### snip ####

import numba

def tanimoto_dist(a,b):
    dotprod =,b)
    tc = dotprod / (np.sum(a) + np.sum(b) - dotprod)
    return 1.0-tc

tsne = TSNE(n_components=2, metric=tanimoto_dist)
tsne_X = tsne.fit_transform(X)
umap_X = umap.UMAP(n_neighbors=8, min_dist=0.3, metric=tanimoto_dist).fit_transform(X)

Next do HDBSCAN. It is simple just call hdbscan.HDBSCAN and fit data! πŸ˜‰

import hdbscan
cluster_tsne = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
cluster_umap = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)

After clustering the object can make spanning tree plot by simple coding.


Image like ..

And make scatter plot with tSNE and UMAP data.

plt.scatter(tsne_X.T[0], tsne_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below
plt.scatter(umap_X.T[0], umap_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below



I think results of UMAP and HDBSCAN are dependent on parameters but both library is easy to implement with several parameters. Nice example is provided from original repository.

And my whole code is pushed to my repository.


Get 3D feature descriptors from PDB file

If reader is interested in drug discovery, chemoinformatics, deep learning or MD, I think the reader might read the article below.
KDEEP is predictor that uses Deep learning(CNN) for affinity prediction. Regarding the article, I found the new python library named HTMD ( High Through Put Molecular Dynamics ). Really I am not good at Computational area such as ab initio or MD calculation. So I can not evaluate the real value of the library but I think it is useful for Comp Chemist.
HTMD can be installed from acellera channel of anaconda with some dependencies. Or can be build from source from github.
One of the function I was interested was “get Voxel descriptors”. A voxel represents a value on a regular grid in three-dimensional space. As same as pixel in 2D space.
It means HTMD can produce 3D descriptor from mol file. And It is easy to calculate the descriptor by using the library. Also HTMD has lots of functions to manipulate PDB or SDF or any file format files.
Following code is example to calculate the descriptors.

from htmd.ui import *
from htmd.molecule import voxeldescriptors
from htmd.molecule.util import boundingBox
# easy to read pdb by using Molecule.
# Also easy to get focused residue id 
atp = Molecule('1atp.pdb')
lck = Molecule('2pl0.pdb')
print(atp.numAtoms, lck.numAtoms)
print(atp.numResidues, lck.numResidues)
atp.get('resid', sel='resname CYS') # get residue id of cysteines
3070 2312
462 362
array([199, 199, 199, 199, 199, 199, 343, 343, 343, 343, 343, 343])
# get sequence

getVoxelDescriptors calculates feature of the mol object and return features as array, centers of voxel.
usercenters is user defined center of voxel. Following example is but example because there are voxels of oblique direction…
If user do not define usercenters the function return suitable size of features but it depends on input file. It means the function returns atypical shape of features.
BTW, each voxel has 8 features(see results of atpeat.shape). The features define the 8 feature of the voxel, (‘hydrophobic’, ‘aromatic’, ‘hbond_acceptor’, ‘hbond_donor’, ‘positive_ionizable’, ‘negative_ionizable’, ‘metal’, ‘occupancies’).
And parameters of these features from Autodock 4.
So the performance of the predictive models using the voxelfeatures depends on performance of Autodock4. But useful for applying machine learning because easy to calculate the 3D features.

Recently there are lots of tools for calculation of molecular features. Which tool shows best performance? πŸ™‚
I do not have the answer but I am interested in the area. I would like to discuss with Comp Chemist.
And I would like to calculate voxelDescriptors in fixed area around the binding ligand.
I wii be happy if reader who is good at HTMD give the hint or tips to do that.

atpfeat, atpcenters = voxeldescriptors.getVoxelDescriptors(atp, usercenters=np.array([[i,i,i] for i in range(100)]), buffer=8)
lckfeat, lckcenters = voxeldescriptors.getVoxelDescriptors(lck, usercenters=np.array([[i,i,i] for i in range(100)]), buffer=8)
print(atpfeat.shape, lckfeat.shape)
print(atpcenters.shape, lckcenters.shape)
(100, 8) (100, 8)
(100, 3) (100, 3)
array([[3.81532829e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [9.99980260e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.08001621e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.64279406e-01],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],

Simple way for making SMILES file #RDKit

To convert SDF to SMILES I write like a following code.

sdf = Chem.SDMolSupplier( 'some.sdf' )
with open('smiles.smi', 'w') as f:
    for mol in sdf:
        smi = Chem.MolToSmiles(mol)

In this way, to write smiles strings with properties it is needed to get properties by using GetProp(“some prop”). If I need several properties my code tend to be long.
Greg who is developer of RDKit advised me to use SmilesMolWriter. πŸ˜‰

I have never used this function so I used it and found it was very useful.
Let me show the example. (copy & paste from Jupyter notebook)

import pprint
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem.rdmolfiles import SmilesWriter

mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None]
# make writer object with a file name.
writer = SmilesWriter('cdk2smi1.smi')
#Check prop names.

#SetProps method can set properties that will be written to files with SMILES.
#The way of writing molecules can perform common way.
for mol in mols:
    writer.write( mol )

Then check the file.

!head -n 10 cdk2smi1.smi
SMILES Name Cluster
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 2
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 2

How about set all props ?

writer = SmilesWriter('cdk2smi2.smi')
for mol in mols:
    writer.write( mol )
!head -n 10 cdk2smi2.smi
SMILES Name id Cluster MODEL.SOURCE MODEL.CCRATIO r_mmffld_Potential_Energy-OPLS_2005 r_mmffld_RMS_Derivative-OPLS_2005 b_mmffld_Minimization_Converged-OPLS_2005
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 ZINC03814457 1 CORINA 3.44 0027  09.01.2008 1 -78.6454 0.000213629 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 ZINC03814459 2 CORINA 3.44 0027  09.01.2008 1 -67.4705 9.48919e-05 1
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 ZINC03814460 2 CORINA 3.44 0027  09.01.2008 1 -89.4303 5.17485e-05 1
Nc1nc(OCC2CCCCC2)c2nc[nH]c2n1 ZINC00023543 ZINC00023543 3 CORINA 3.44 0027  09.01.2008 1 -70.2463 6.35949e-05 1

Wow it works fine. This method is useful and easy to use.
Do most of the RDKiters use the function?
I’m totally out of fashion!!!

I pushed my Fast Clustering code to rdkit/Contrib repository. It was big news for me.

Ultra fast clustering script with RDKit #RDKit

Some years ago, I got very useful information for molecular clustering. Bayon is ultra fast clustering tool. The author made not only Japanese-tutorial but also English-tutorial.
This tools is easy to use but to use bayon in chemoinformatics area, user needs data preparation.

I wrote simple script that converts smiles to bayon input format and then perform clustering and parse output data.
Main code is following.

import argparse
import subprocess
import pickle
import os
from rdkit import Chem
from rdkit.Chem import AllChem

parser = argparse.ArgumentParser( "Fast clustering for chemoinformatics" )
parser.add_argument( "input" )
parser.add_argument( "n", help = "the number of clusters" )
#parser.add_argument( "-l", help = "limit value of cluster bisection" )
#parser.add_argument( "-p", help = "output similarity points" )
parser.add_argument( "--output", default = "clustered.tsv" )
parser.add_argument( "-c", help = "filename of centroid", default = "centroid.tsv" )

def smi2fp( molid, smiles ):
    mol = Chem.MolFromSmiles( smiles )
    onbits = AllChem.GetMorganFingerprintAsBitVect( mol, 2 ).GetOnBits()
    row = molid
    for bit in onbits:
        row += "\tFP_{}\t1.0".format( bit )
    row += "\n"
    return row

if __name__ == "__main__":
    args = parser.parse_args()
    inputf = open( args.input, "r" )
    n = args.n
    c = args.c
    output = args.output
    tempf = open( "fp.tsv", "w" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        tempf.write( smi2fp( line[0], line[1] ))
    res = "time bayon -p -c {} -n {} fp.tsv > {}".format( c, n, output ), shell=True )

    #parse results
    parsefile = open( output.split(".")[0]+"_parse.tsv", "w" )
    inputf = open( output, "r" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        cluster_id = line[0]
        for i in range( 1, len( line )-1, 2 ) :
            molid = line[ i ]
            point = line[ i + 1 ]
            parsefile.write("{}\t{}\tCLS_ID_{}\n".format( molid, point, cluster_id ))

The code perform clustering molecules and output cluster with point ( similarity ) and parse default bayon format.

I ran the code with rdkit cdk2.sdf data.
47 compound clustered into 5 clusters within 0.006s!

iwatobipen$ python cdk2.smi 5

real	0m0.015s
user	0m0.006s
sys	0m0.002s

It seems work fine. πŸ˜‰

plot data using RDKit-PandasTools

I often use SDMolSupplier method to read SDF. But, PandasTool is another useful way to read SDF.
But, to handle property data, I need to convert data to float.
Today I read a major supplier’s SDF. And plot data using seaborn.

%matplotlib inline
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit.Chem import PandasTools

nx = [ mol for mol in Chem.SDMolSupplier( 'dataset.sdf' ) ]
print( m = nx[0][n for n in m.GetPropNames()] )


nx = PandasTools.LoadSDF( 'dataset.sdf' )
#convert data
nx.FSP3 = natx.FSP3.apply( np.float32 )
nx.ROT = natx.ROT.apply( np.float32 )
nx.TPSA = natx.TPSA.apply( np.float32 )
nx.CLogP = natx.CLogP.apply( np.float32 )
nx.MolWeight = natx.MolWeight.apply( np.float32 )

sns.lmplot( 'ROT','FSP3', data=natx, )
sns.jointplot(natx.MolWeight, natx.FSP3, kind='hex')

Fig 3 indicates that the dataset have many FSp3 rich molecules.
I’m interested in the dataset and will analyse more details of the Dataset.
What is the most important thing in the library design ? Novelty, synthetic accessibility, cost, diversity, druggability, etc. How do you define the quality of compound / library ?


Generate diamine ring systems using RDKit # RDKit

Somedays ago, I posted new way to represent chemical space using exit vector.
So, I want to generate set of diamine ring systems. It’s easy to generate molecule set using RDKit.
Let’s try it.
At first I defined ring systems.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

c5 = Chem.MolFromSmiles( 'C1CCCC1' )
c6 = Chem.MolFromSmiles( 'C1CCCCC1' )
c7 = Chem.MolFromSmiles( 'C1CCCCCC1' )
c45 = Chem.MolFromSmiles( 'C2CC1CCC1C2' )
c46 = Chem.MolFromSmiles( 'C2CCC1CCC1C2' )
c55 = Chem.MolFromSmiles( 'C1CC2CCCC2C1' )
c56 = Chem.MolFromSmiles( 'C2CCC1CCCC1C2' )
c66 = Chem.MolFromSmiles( 'C1CCC2CCCCC2C1' )
sp35 = Chem.MolFromSmiles( 'C1CC11CCCC1' )
sp36 =  Chem.MolFromSmiles( 'C1CC11CCCCC1' )
sp45 = Chem.MolFromSmiles( 'C1CCC11CCCC1' )
sp46 =  Chem.MolFromSmiles( 'C1CCC11CCCCC1' )
sp55 = Chem.MolFromSmiles( 'C1CCC11CCCC1' )
sp55 = Chem.MolFromSmiles( 'C1CCCC11CCCC1' )
sp65 = Chem.MolFromSmiles( 'C1CCCC11CCCCC1' )
sp66 =  Chem.MolFromSmiles( 'C1CCCCC11CCCCC1' )

ringsets = [
    c5, c6, c7,
    c45, c46, c55, c56, c66,
    sp35, sp36, sp45, sp46, sp55, sp65, sp66]
rxn = AllChem.ReactionFromSmarts( '[CH2&$(C(CC)(CC)):2]>>[NH:2]' )
monoamine = []

Then, I defined function that returns only unique mols.

def get_unipro( ps ):
    uniq = set( [ Chem.MolToSmiles( mol[0], isomericSmiles =True ) for mol in ps ] )
    mols = [ Chem.MolFromSmiles( smi ) for smi in uniq ]
    return mols

def get_unimol( mols ):
    uniq = set( [ Chem.MolToSmiles( mol, isomericSmiles =True ) for mol in mols ] )
    mols = [ Chem.MolFromSmiles( smi ) for smi in uniq ]
    return mols

OK, Ready. Run reaction!

for ring in ringsets:
    ps = rxn.RunReactants( (ring,) )
    res = get_unipro( ps )
    monoamine.extend( res )
diamine = []
for ring in monoamine:
    ps = rxn.RunReactants( (ring,) )
    res = get_unipro( ps )
    diamine.extend( res )
diamine = get_unimol( diamine )
Draw.MolsToGridImage( ringsets, molsPerRow = 5  )


Draw.MolsToGridImage(diamine, molsPerRow= 5 )


It’s works fine.

calculate exit vector distance of each molecule

Some days ago, I participated a seminar about drug discovery.
All presentations were nice, and I interested in one of unique method to representation chemical space, named “exit vector plot”.
ref is following
Following Ramachandran: exit vector plots (EVP) as a tool to navigate chemical space covered by 3D bifunctional scaffolds. The case of cycloalkanes

The concept is, chemical space is represented as two vectors n1, n2 with 4 parameters phi1, phi2, theta( dihedral angle ), r.
It’s like a ramachandran plot that is used to plot structure of protein.


In this work, the author calculate exit vector about several ring systems and clustering them.
I think this concept is very useful because if chemist want to replace a ring system to another one, it’s difficult to search molecules that shows low structure similarity but has high exit vector similarity.

So, if we can search molecules use exit vector similarity, it will be new way to expand chemical space.
In the paper, the author clustering molecules using Euclidean distance of exit vector. Distance is almost dissimilarity.
Now let’s try to write exit vector calculator using RDKit!
In following code, I focused in 2nd-diamine set because it was easy to define two vectors(n1, n2).
First I prepare sample sdf that has energy minimized 3D structure.(there is not in the code)
Then write code!
RDKit has cool function to calculate distance and angle in rdMolTransform class.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolTransforms
import numpy as np
from rdkit.Chem import Draw
sdf = Chem.SDMolSupplier( "3d_amine.sdf", removeHs=False )
mols = [ m for m in sdf ]
mol = mols[1]
mol.GetSubstructMatch( Chem.MolFromSmarts( "N[H]" ) )
atoms = [atom for atom in mol.GetAtoms()]

def getev( mol ):
    if mol.GetNumConformers() >= 1:
        matches = mol.GetSubstructMatches( Chem.MolFromSmarts( "N[H]" ) )
        conf = mol.GetConformer()
        theta = rdMolTransforms.GetDihedralDeg( conf,
                                                matches[1][1]  )
        temp_phi1 = 180 - rdMolTransforms.GetAngleDeg(conf,
        temp_phi2 = 180 - rdMolTransforms.GetAngleDeg(conf,
        if temp_phi1 >= temp_phi2:
            phi1 = temp_phi1
            phi2 = temp_phi2
            phi1 = temp_phi2
            phi2 = temp_phi1

        r = rdMolTransforms.GetBondLength( conf, matches[0][0], matches[1][0] )
        return theata, phi1, phi2, r
        print( "No conformer!" )

def transform_cartegian( theta, phi1, phi2, r ):
    theta = np.deg2rad( theta )
    phi1 = np.deg2rad( phi1 )
    phi2 = np.deg2rad( phi2 )
    x = np.sin( theta ) * np.sin( phi1 ) * np.sin( phi2 ) *r
    y = np.sin( theta ) * np.sin( phi1 ) * np.cos( phi2 ) *r
    z = np.sin( theta ) * np.cos( phi1 ) *r
    t = np.cos( theta ) *r
    return x, y, z, t

def get_dist(v1,v2):
    v1 = np.asarray( v1 )
    v2 = np.asarray( v2 )
    delta =  v1 - v2
    d = np.linalg.norm( delta )
    return d

def calc_distance( mol1, mol2 ):
    theta1, phi11, phi21, r1 = getev( mol1 )
    theta2, phi12, phi22, r2 = getev( mol2 )
    cart1 = transform_cartegian( theta1, phi11, phi21, r1 )
    cart2 = transform_cartegian( theta2, phi12, phi22, r2 )
    d = get_dist( cart1, cart2 )
    return d

All done!
Then run code.

for i, mol in enumerate( mols ):
    d = calc_distance(mols[3],mol)
    print( 3, i, d )

I got following response.
idx1 idx2 distace
3 0 4.90798883048
3 1 1.48017700543
3 2 6.08814365316
3 3 0.0
3 4 3.93873942603
3 5 5.45068906229
3 6 1.40146560569
3 7 0.0619944778129
3 8 6.92182054937
3 9 6.22560061852
3 10 6.19913412456
3 11 5.75093874617
3 12 5.38531808804
3 13 1.85179121146
3 14 1.80235655185
idx1, 6, 7 are near to idx4.
Let see…

Draw.MolsToGridImage( [mols[3],mols[1], mols[6], mols[7]] )
Draw.MolsToGridImage( mols )

I got following images.
It’s seems works fine.

selected mols
all mols

I uploaded the code to github.