Chemical space visualization and clustering with HDBSCAN and RDKit #RDKit

I caught a flu last Monday. So I stay home and rest in this week…… :-(
I’m having a fever and hope will get better soon.
BTW, recently I found new technique for dimension reduction called Uniform Manifold Approximation and Projection (UMAP). It was also topics in my twitter’s TL.
URL links of original paper and github repository are below.
https://arxiv.org/pdf/1802.03426.pdf
https://github.com/lmcinnes/umap
The repository provides example notebook how does UMAP works. And it is very easy to perform the analysis. I tried to use UMAP with almost default settings and to perform clustering with the data.
To perform the clustering I used HDBSCAN today it is based on DBSCAN. DBSCAN is density based clustering method and it is not required number of clusters for input. It is main difference for other clustering method.
DBSCAN is implemented Scikit-learn so it is easy to perform it.
One of difficulty of DBSCAN is parameter selection and the handling of variable density clusters.
In Fig1 of the original paper[*] shows difference between k-means, DBSCAN and HDBSCAN. DBSCAN can not detect some clusters but HDBSCAN can detect them it is interesting for me.
*https://arxiv.org/pdf/1705.07321.pdf [HDBSCAN]
**https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf [DBSCAN original paper]
HDBSCAN is hierarchical clustering algorithm but it works so fast. It is good point of the algorithm. It uses ‘runt pruning’ with parameter m ‘minimum cluster size’. Any branch of the cluster tree that represents a cluster of less than the size is pruned out of the tree.

Good news! HDBSCAN is python package for unsupervised learning to find clusters. So you can install HDBSCAN via pip or conda. ⭐️

Now move to code. I used GSK3b inhibitor as dataset and each Fingerprint was calculated with RDKit MorganFP.
Then perfomed tSNE and UMAP with original metrics ‘Tanimoto dissimilarity’.
@number.njit() decorator accelerates calculation speed.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.manifold import TSNE
import umap
sns.set_context('poster')
sns.set_style('white')
sns.set_color_codes()
plot_kwds = {'alpha':0.5, 's':80, 'linewidth':0}

#### snip ####
#### load SDF and calculate fingerprint step ####
#### snip ####

import numba

@numba.njit()
def tanimoto_dist(a,b):
    dotprod = np.dot(a,b)
    tc = dotprod / (np.sum(a) + np.sum(b) - dotprod)
    return 1.0-tc

tsne = TSNE(n_components=2, metric=tanimoto_dist)
tsne_X = tsne.fit_transform(X)
umap_X = umap.UMAP(n_neighbors=8, min_dist=0.3, metric=tanimoto_dist).fit_transform(X)

Next do HDBSCAN. It is simple just call hdbscan.HDBSCAN and fit data! ;-)

import hdbscan
cluster_tsne = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
cluster_umap = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
cluster_tsne.fit(tsne_X)
cluster_umap.fit(umap_X)

After clustering the object can make spanning tree plot by simple coding.

cluster_umap.minimum_spanning_tree_.plot(edge_cmap='viridis',
                                   edge_alpha=0.6,
                                   node_size=90,
                                   edge_linewidth=2)

Image like ..

And make scatter plot with tSNE and UMAP data.

plt.scatter(tsne_X.T[0], tsne_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below
plt.scatter(umap_X.T[0], umap_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below

tSNE

UMAP

I think results of UMAP and HDBSCAN are dependent on parameters but both library is easy to implement with several parameters. Nice example is provided from original repository.

And my whole code is pushed to my repository.
https://github.com/iwatobipen/chemo_info/blob/master/chemicalspace2/HDBSCAN_Chemoinfo.ipynb

Advertisements

4 thoughts on “Chemical space visualization and clustering with HDBSCAN and RDKit #RDKit

  1. Hi and thanks for sharing your implementation.
    My question is probably very basic but I’m relatively new to python:
    Does the fit_transform functions (in tSNE and UMAP) and the fit function (HDBSCAN) preserve the data order? So can I assume that X[i] correspond to tsne_X[i] and umap_X[i]?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.