Chemical space visualization and clustering with HDBSCAN and RDKit #RDKit

I caught a flu last Monday. So I stay home and rest in this week…… 😦
I’m having a fever and hope will get better soon.
BTW, recently I found new technique for dimension reduction called Uniform Manifold Approximation and Projection (UMAP). It was also topics in my twitter’s TL.
URL links of original paper and github repository are below.
The repository provides example notebook how does UMAP works. And it is very easy to perform the analysis. I tried to use UMAP with almost default settings and to perform clustering with the data.
To perform the clustering I used HDBSCAN today it is based on DBSCAN. DBSCAN is density based clustering method and it is not required number of clusters for input. It is main difference for other clustering method.
DBSCAN is implemented Scikit-learn so it is easy to perform it.
One of difficulty of DBSCAN is parameter selection and the handling of variable density clusters.
In Fig1 of the original paper[*] shows difference between k-means, DBSCAN and HDBSCAN. DBSCAN can not detect some clusters but HDBSCAN can detect them it is interesting for me.
** [DBSCAN original paper]
HDBSCAN is hierarchical clustering algorithm but it works so fast. It is good point of the algorithm. It uses ‘runt pruning’ with parameter m ‘minimum cluster size’. Any branch of the cluster tree that represents a cluster of less than the size is pruned out of the tree.

Good news! HDBSCAN is python package for unsupervised learning to find clusters. So you can install HDBSCAN via pip or conda. ⭐️

Now move to code. I used GSK3b inhibitor as dataset and each Fingerprint was calculated with RDKit MorganFP.
Then perfomed tSNE and UMAP with original metrics ‘Tanimoto dissimilarity’.
@number.njit() decorator accelerates calculation speed.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.manifold import TSNE
import umap
plot_kwds = {'alpha':0.5, 's':80, 'linewidth':0}

#### snip ####
#### load SDF and calculate fingerprint step ####
#### snip ####

import numba

def tanimoto_dist(a,b):
    dotprod =,b)
    tc = dotprod / (np.sum(a) + np.sum(b) - dotprod)
    return 1.0-tc

tsne = TSNE(n_components=2, metric=tanimoto_dist)
tsne_X = tsne.fit_transform(X)
umap_X = umap.UMAP(n_neighbors=8, min_dist=0.3, metric=tanimoto_dist).fit_transform(X)

Next do HDBSCAN. It is simple just call hdbscan.HDBSCAN and fit data! πŸ˜‰

import hdbscan
cluster_tsne = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
cluster_umap = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)

After clustering the object can make spanning tree plot by simple coding.


Image like ..

And make scatter plot with tSNE and UMAP data.

plt.scatter(tsne_X.T[0], tsne_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below
plt.scatter(umap_X.T[0], umap_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below



I think results of UMAP and HDBSCAN are dependent on parameters but both library is easy to implement with several parameters. Nice example is provided from original repository.

And my whole code is pushed to my repository.


mol encoder with Pytorch

Variable Auto Encoder (VAE) is unique method that is used for learning latent representations. VAE encodes discriminative vector to continuous vector in latent space. There are lots of examples in github.
In 2016, AlΓ‘n Aspuru-Guzik reported new de novo design method by using VAE. The approach represents molecules as SMLIES and SMILES strings are converted one-hot vector. The vectors used for deep learning.
And Maxhodak implemented molencoder with Keras! It is nice work and the strong impact is given to the chemoinformatics area I think.
Repository is below.

The code depends on keras 1.1.1 and python 2.7 and some packages. Recently keras version is 2.x and major code supports python 3.x. I would like to test mol VAE in python 3.x environment. Also I am now learning pytorch, so I would like to convert the code from keras based to pytorch based.
I spent several days for coding, but I found that it was pioneer!
Researcher who is Pande’s group published nice work.

The molencoder works on python3.5, 3.6, and CUDA-8.0 environments with pytorch. The code seems almost same as keras-molecules but there are some differences.
Main difference is activation function. The original code uses ReLU but the code uses SeLU for activation.
I have not check the effect of difference of the function.
I wrote interpolation script by using molencoder.
The code is below.

Description about some arguments.
SOUCE; The starting point of interpolation.
TARGET; The goal point of the experiment.
STEPS; How many steps the model will generate SMILES strings from latent space between source and target.
* Following code will run only with GPU machine because I hard-coded “someprop.cuda()” instead of using if function.

import numpy as np
import torch
import argparse
from torch.autograd import Variable
from rdkit import Chem
from molencoder.models import MolEncoder, MolDecoder
from molencoder.models import MolEncoder, MolDecoder
from molencoder.utils import( load_dataset, initialize_weights,ReduceLROnPlateau, save_checkpoint, validate_model)
from molencoder.featurizers import Featurizer, OneHotFeaturizer

SOURCE = 'c1ccccn1'
DEST =  'c1ccccc1'
STEPS = 200
#charset from chembl

charset = [' ', '#', '%', '(', ')', '+', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', '@', 'A', 'B', 'C', 'F', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'S', 'T', 'V', 'X', 'Z', '[', '\\', ']', 'a', 'c', 'e', 'g', 'i', 'l', 'n', 'o', 'p', 'r', 's', 't']

def decode_smiles_from_index(vec, charset):
    return "".join(map(lambda x:charset[x],vec)).strip()

def get_arguments():
    parser = argparse.ArgumentParser(description="Interpolate from source to dest in steps")
    parser.add_argument("--source", type=str, default=DEST)
    parser.add_argument("--dest", type=str, default=SOURCE)
    parser.add_argument("--steps", type=int, default=STEPS)
    return parser.parse_args()

def interpolate(source, dest, steps, charset, encoder, decoder):
    source_just = source.ljust(width)
    dest_just = dest.ljust(width)
    onehot = OneHotFeaturizer(charset=charset)
    sourcevec = onehot.featurize(smiles=[source_just])
    destvec = onehot.featurize(smiles=[dest_just])
    source_encoded = Variable(torch.from_numpy(sourcevec).float()).cuda()
    dest_encoded = Variable(torch.from_numpy(destvec).float()).cuda()
    source_x_latent = encoder(source_encoded)
    dest_x_latent = encoder(dest_encoded)
    step = (dest_x_latent-source_x_latent)/float(steps)
    results = []
    for i in range(steps):
        item = source_x_latent + (step*i)
        sampled = np.argmax(decoder(item).cpu().data.numpy(), axis=2)
        decode_smiles = decode_smiles_from_index(sampled[0], charset)
        results.append((i, item, decode_smiles))
    return results

def main():
    args= get_arguments()
    encoder = MolEncoder( c = len(charset))
    decoder = MolDecoder( c = len(charset))
    print( torch.cuda.is_available() )
    encoder = MolEncoder( c = len(charset)).cuda()
    decoder = MolDecoder( c = len(charset)).cuda()
    bestmodel = torch.load("model_best.pth.tar")
    #bestmodel = torch.load("tempcheckpoint.pth.tar")

    results = interpolate( args.source, args.dest, args.steps, charset, encoder, decoder )
    for result in results:
        print(result[0], result[2])

if __name__=="__main__":

Just a note for me.

Get 3D feature descriptors from PDB file

If reader is interested in drug discovery, chemoinformatics, deep learning or MD, I think the reader might read the article below.
KDEEP is predictor that uses Deep learning(CNN) for affinity prediction. Regarding the article, I found the new python library named HTMD ( High Through Put Molecular Dynamics ). Really I am not good at Computational area such as ab initio or MD calculation. So I can not evaluate the real value of the library but I think it is useful for Comp Chemist.
HTMD can be installed from acellera channel of anaconda with some dependencies. Or can be build from source from github.
One of the function I was interested was “get Voxel descriptors”. A voxel represents a value on a regular grid in three-dimensional space. As same as pixel in 2D space.
It means HTMD can produce 3D descriptor from mol file. And It is easy to calculate the descriptor by using the library. Also HTMD has lots of functions to manipulate PDB or SDF or any file format files.
Following code is example to calculate the descriptors.

from htmd.ui import *
from htmd.molecule import voxeldescriptors
from htmd.molecule.util import boundingBox
# easy to read pdb by using Molecule.
# Also easy to get focused residue id 
atp = Molecule('1atp.pdb')
lck = Molecule('2pl0.pdb')
print(atp.numAtoms, lck.numAtoms)
print(atp.numResidues, lck.numResidues)
atp.get('resid', sel='resname CYS') # get residue id of cysteines
3070 2312
462 362
array([199, 199, 199, 199, 199, 199, 343, 343, 343, 343, 343, 343])
# get sequence

getVoxelDescriptors calculates feature of the mol object and return features as array, centers of voxel.
usercenters is user defined center of voxel. Following example is but example because there are voxels of oblique direction…
If user do not define usercenters the function return suitable size of features but it depends on input file. It means the function returns atypical shape of features.
BTW, each voxel has 8 features(see results of atpeat.shape). The features define the 8 feature of the voxel, (‘hydrophobic’, ‘aromatic’, ‘hbond_acceptor’, ‘hbond_donor’, ‘positive_ionizable’, ‘negative_ionizable’, ‘metal’, ‘occupancies’).
And parameters of these features from Autodock 4.
So the performance of the predictive models using the voxelfeatures depends on performance of Autodock4. But useful for applying machine learning because easy to calculate the 3D features.

Recently there are lots of tools for calculation of molecular features. Which tool shows best performance? πŸ™‚
I do not have the answer but I am interested in the area. I would like to discuss with Comp Chemist.
And I would like to calculate voxelDescriptors in fixed area around the binding ligand.
I wii be happy if reader who is good at HTMD give the hint or tips to do that.

atpfeat, atpcenters = voxeldescriptors.getVoxelDescriptors(atp, usercenters=np.array([[i,i,i] for i in range(100)]), buffer=8)
lckfeat, lckcenters = voxeldescriptors.getVoxelDescriptors(lck, usercenters=np.array([[i,i,i] for i in range(100)]), buffer=8)
print(atpfeat.shape, lckfeat.shape)
print(atpcenters.shape, lckcenters.shape)
(100, 8) (100, 8)
(100, 3) (100, 3)
array([[3.81532829e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [9.99980260e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.08001621e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.64279406e-01],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],

Replace strings in pandas dataframe / Memo

I would like to convert smiles string which has Cl or Br atoms convert to new one which has L as Cl and R as Br.
I wrote simple code for my note.

import pandas as pd
df = pd.read_table('cdk2.smi', sep=' ')
# make replace function with lumbda function
rep_func = lambda x: x.replace('Cl', 'L').replace('Br','R')
df['repsmi'] = df.SMILES.apply(rep_func)
for row in df.repsmi:                                      
    ...:     if "L" in row:
    ...:         print(row)
    ...:     else: pass

Result was..


OK. If I use vectrize function the performance will be increased.

mol2vec analogy of word2vec #RDKit

Last year researcher who is in Bio MedX published following article.
And recently the article was published from ACS.
The concept of mol2vec is same as word2vec.
Word2vec converts word to vector with large data set of corpus and showed success in NLP. Mol2Vec converts molecules to vector with ECFP information.
Fortunately Mol2Vec source code is uploaded to github.
I played mol2vec by reference to original repository.
mol2vec wraps gensim that is python library for topic modeling so user can make corpus and model very easily.
At first I make corpus and model by using very small dataset.
*github repo. provides model which is made from ZINC DB. Following code is sample of how to make model in mol2vec.

from mol2vec.features import DfVec, mol2alt_sentence, mol2sentence, generate_corpus, insert_unk
from mol2vec.features import train_word2vec_model
generate_corpus("cdk2.sdf", "cdk2.corp", 1 )
#I do not need to make unk file because I used same data for analysis.
#insert_unk('cdk2.corp', 'unk_cdk2.corp')
train_word2vec_model('cdk2.corp', 'cdk2word2vec.model', min_count=0)

Cool! Easy to make model.
Next I conduct tSNE analysis and visualize data.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from gensim.models import word2vec
from rdkit import Chem
from rdkit.Chem import Draw
from mol2vec.features import mol2alt_sentence, MolSentence, DfVec, sentences2vec
from mol2vec.helpers import depict_identifier, plot_2D_vectors, IdentifierTable, mol_to_svg

#from supplied data (skip gram word size of 10, radi 1, UNK to replace all identifiers that appear less than 4 time)
#model = word2vec.Word2Vec.load('model_300dim.pkl')
model = word2vec.Word2Vec.load('cdk2word2vec.model')
print("****number of unique identifiers in the model: {}".format(len(model.wv.vocab.keys())))
# get feature vec of 2246728737

mols = [ mol for mol in Chem.SDMolSupplier("cdk2.sdf")]

# encode molecules as sentence(represented by MorganFP)
sentences = [ mol2alt_sentence(mol, 1) for mol in mols ]
flat_list = [ item for sublist in sentences for item in sublist ]
mol_identifiers_unique = list(set( flat_list ))
print("mols has {} identifiers and has {} unique identifiers".format(len(flat_list), len(mol_identifiers_unique) ))
# make projection of 300D vectors to 2D using PCA/TSNE combination
df_vec = pd.DataFrame()
df_vec['identifier'] = mol_identifiers_unique
df_vec.index = df_vec['identifier']

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

pca_model = PCA(n_components=30)
#tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='cosine')
# I use sklearn v0.19, metric = 'cosine' did not work see following URL / SOF.
tsne_model = TSNE(n_components=2, perplexity=10, n_iter=1000, metric='euclidean')

tsne_pca = tsne_model.fit_transform( pca_model.fit_transform([model.wv.word_vec(x) for x in mol_identifiers_unique]))

df_vec['PCA-tSNE-c1'] = tsne_pca.T[0]
df_vec['PCA-tSNE-c2'] = tsne_pca.T[1]
projections = df_vec.to_dict()
# Function that extracts projected values for plotting
def get_values(identifier, projections):
    return np.array((projections['PCA-tSNE-c1'][str(identifier)],projections['PCA-tSNE-c2'][str(identifier)]))

# substructure vectors are plotted as green and molecule vectors is plotted as cyan.

mol_vals = [ get_values(x, projections) for x in sentences[0]]
subplt = plot_2D_vectors( mol_vals )

plot_2D_vectors can visualize vector of each fingerprint and molecule.
I often use PCA to analyze of chemical space but results of PCA depends on dataset. But Mol2Vec make model first and apply data to model. It means Mol2vec analyze different dataset on the same basis(model).
My example code is very simple and shows a small part of Mol2Vec.
BTW, word2vec can calculate word as vec for example tokyo – japan + paris = france.
How about Mol2Vec ? mol A – mol B + mol C = mol D???? Maybe it is difficult because Finger print algorithm uses Hash function I think.
Reader who is interested in the package, please visit official repository and enjoy. πŸ˜‰

Simple way for making SMILES file #RDKit

To convert SDF to SMILES I write like a following code.

sdf = Chem.SDMolSupplier( 'some.sdf' )
with open('smiles.smi', 'w') as f:
    for mol in sdf:
        smi = Chem.MolToSmiles(mol)

In this way, to write smiles strings with properties it is needed to get properties by using GetProp(“some prop”). If I need several properties my code tend to be long.
Greg who is developer of RDKit advised me to use SmilesMolWriter. πŸ˜‰

I have never used this function so I used it and found it was very useful.
Let me show the example. (copy & paste from Jupyter notebook)

import pprint
from rdkit import rdBase
from rdkit import Chem
from rdkit.Chem.rdmolfiles import SmilesWriter

mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None]
# make writer object with a file name.
writer = SmilesWriter('cdk2smi1.smi')
#Check prop names.

#SetProps method can set properties that will be written to files with SMILES.
#The way of writing molecules can perform common way.
for mol in mols:
    writer.write( mol )

Then check the file.

!head -n 10 cdk2smi1.smi
SMILES Name Cluster
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 2
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 2

How about set all props ?

writer = SmilesWriter('cdk2smi2.smi')
for mol in mols:
    writer.write( mol )
!head -n 10 cdk2smi2.smi
SMILES Name id Cluster MODEL.SOURCE MODEL.CCRATIO r_mmffld_Potential_Energy-OPLS_2005 r_mmffld_RMS_Derivative-OPLS_2005 b_mmffld_Minimization_Converged-OPLS_2005
CC(C)C(=O)COc1nc(N)nc2[nH]cnc12 ZINC03814457 ZINC03814457 1 CORINA 3.44 0027  09.01.2008 1 -78.6454 0.000213629 1
Nc1nc(OCC2CCCO2)c2nc[nH]c2n1 ZINC03814459 ZINC03814459 2 CORINA 3.44 0027  09.01.2008 1 -67.4705 9.48919e-05 1
Nc1nc(OCC2CCC(=O)N2)c2nc[nH]c2n1 ZINC03814460 ZINC03814460 2 CORINA 3.44 0027  09.01.2008 1 -89.4303 5.17485e-05 1
Nc1nc(OCC2CCCCC2)c2nc[nH]c2n1 ZINC00023543 ZINC00023543 3 CORINA 3.44 0027  09.01.2008 1 -70.2463 6.35949e-05 1

Wow it works fine. This method is useful and easy to use.
Do most of the RDKiters use the function?
I’m totally out of fashion!!!

I pushed my Fast Clustering code to rdkit/Contrib repository. It was big news for me.

API for opentargets

Association of drug targets with diseases are important information for drug discovery. There are lots of databases to provide these information I think.

I like python. πŸ˜‰ So, I am interested in following article.
Opentargets is a ” a data integration and visualization platform that provides evidence about the association of known and potential drug targets with diseases”.
That sounds good. The platform is provided by web app. You can use the data from
UI is very simple and sophisticated.
And also Python API is provided from pypi / github. I used the package by referring to the document.
At first I installed the package from pip command.

iwatobipen$ pip install git+git://

Next, I tried to get evidence for a disease. Retrieved data related my query and printed it.

import pprint
from opentargets import OpenTargetsClient
ot = OpenTargetsClient()
e_for_disease = ot.get_evidence_for_disease('arthritis')
res = []
for e in e_for_disease:
        res.append([ e['disease']['efo_info']['label'], 
                   e['drug']['molecule_name']] )
# get unique dataset
uniqres = []
for i in res:
    if i in uniqres:

# check the result
for i in uniqres:

['gout', 'Cyclooxygenase inhibitor', 'Cyclooxygenase', 'INDOMETHACIN']
['rheumatoid arthritis', 'T-lymphocyte activation antigen CD86 inhibitor', 'T-lymphocyte activation antigen CD86', 'ABATACEPT']
['rheumatoid arthritis', 'Dihydrofolate reductase inhibitor', 'Dihydrofolate reductase', 'METHOTREXATE']
['rheumatoid arthritis', 'T-lymphocyte activation antigen CD80 inhibitor', 'T-lymphocyte activation antigen CD80', 'ABATACEPT']
['rheumatoid arthritis', 'Dihydroorotate dehydrogenase inhibitor', 'Dihydroorotate dehydrogenase', 'LEFLUNOMIDE']
['osteoarthritis', 'Plasminogen inhibitor', 'Plasminogen', 'TRANEXAMIC ACID']
['gout', 'Xanthine dehydrogenase inhibitor', 'Xanthine dehydrogenase', 'ALLOPURINOL']
['osteoarthritis', 'Adrenergic receptor agonist', 'Adrenergic receptor', 'EPINEPHRINE']
['rheumatoid arthritis', 'Janus Kinase (JAK) inhibitor', 'Janus Kinase (JAK)', 'TOFACITINIB']
['rheumatoid arthritis', 'TNF-alpha inhibitor', 'TNF-alpha', 'ADALIMUMAB']
['chronic childhood arthritis', 'Dihydrofolate reductase inhibitor', 'Dihydrofolate reductase', 'METHOTREXATE']
['osteoarthritis, hip', 'Adrenergic receptor agonist', 'Adrenergic receptor', 'EPINEPHRINE']
['rheumatoid arthritis', 'Interleukin-6 receptor alpha subunit inhibitor', 'Interleukin-6 receptor alpha subunit', 'TOCILIZUMAB']
['osteoarthritis', 'Coagulation factor X inhibitor', 'Coagulation factor X', 'RIVAROXABAN']
['osteoarthritis', 'Cyclooxygenase-2 inhibitor', 'Cyclooxygenase-2', 'CELECOXIB']
['rheumatoid arthritis', 'Cyclooxygenase-2 inhibitor', 'Cyclooxygenase-2', 'CELECOXIB']
['chronic childhood arthritis', 'Cyclooxygenase-2 inhibitor', 'Cyclooxygenase-2', 'CELECOXIB']
['rheumatoid arthritis', 'FK506-binding protein 1A inhibitor', 'FK506-binding protein 1A', 'TACROLIMUS']
['osteoarthritis', 'Mu opioid receptor agonist', 'Mu opioid receptor', 'MORPHINE']
['rheumatoid arthritis', 'TNF-alpha inhibitor', 'TNF-alpha', 'INFLIXIMAB']
['chronic childhood arthritis', 'Histamine H2 receptor antagonist', 'Histamine H2 receptor', 'FAMOTIDINE']
['osteoarthritis', 'Adrenergic receptor alpha-2 agonist', 'Adrenergic receptor alpha-2', 'DEXMEDETOMIDINE']
['chronic childhood arthritis', 'Cyclooxygenase inhibitor', 'Cyclooxygenase', 'IBUPROFEN']

Next, retrieve data with limited field. The returned object can be converted pandas dataframe.

from opentargets import OpenTargetsClient
ct = OpenTargetsClient()
res = ct.get_associations_for_target('PDCD1',
resdf = res.to_dataframe()

Statistics method can handle scores. (I do not understand the part well so I need to read document)

from opentargets import OpenTargetsClient
from opentargets.statistics import HarmonicSumScorer
ot = OpenTargetsClient()
r = ot.get_associations_for_target('PDCD1')
interesting_datatypes = ['genetic_association', 'known_drug', 'somatic_mutation']

def score_with_datatype_subset(datatypes,results):
    for i in results:
        datatype_scores = i['association_score']['datatypes']
        filtered_scores = [datatype_scores[dt] for dt in datatypes]
        custom_score = HarmonicSumScorer.harmonic_sum(filtered_scores)
        if custom_score:
            yield(custom_score, i['disease']['id'], dict(zip(datatypes, filtered_scores)))
for i in score_with_datatype_subset(interesting_datatypes, r):

Whole code is pushed my repo.