I caught a flu last Monday. So I stay home and rest in this week…… :-(
I’m having a fever and hope will get better soon.
BTW, recently I found new technique for dimension reduction called Uniform Manifold Approximation and Projection (UMAP). It was also topics in my twitter’s TL.
URL links of original paper and github repository are below.
Click to access 1802.03426.pdf
https://github.com/lmcinnes/umap
The repository provides example notebook how does UMAP works. And it is very easy to perform the analysis. I tried to use UMAP with almost default settings and to perform clustering with the data.
To perform the clustering I used HDBSCAN today it is based on DBSCAN. DBSCAN is density based clustering method and it is not required number of clusters for input. It is main difference for other clustering method.
DBSCAN is implemented Scikit-learn so it is easy to perform it.
One of difficulty of DBSCAN is parameter selection and the handling of variable density clusters.
In Fig1 of the original paper[*] shows difference between k-means, DBSCAN and HDBSCAN. DBSCAN can not detect some clusters but HDBSCAN can detect them it is interesting for me.
*https://arxiv.org/pdf/1705.07321.pdf [HDBSCAN]
**https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf [DBSCAN original paper]
HDBSCAN is hierarchical clustering algorithm but it works so fast. It is good point of the algorithm. It uses ‘runt pruning’ with parameter m ‘minimum cluster size’. Any branch of the cluster tree that represents a cluster of less than the size is pruned out of the tree.
Good news! HDBSCAN is python package for unsupervised learning to find clusters. So you can install HDBSCAN via pip or conda. ⭐️
Now move to code. I used GSK3b inhibitor as dataset and each Fingerprint was calculated with RDKit MorganFP.
Then perfomed tSNE and UMAP with original metrics ‘Tanimoto dissimilarity’.
@number.njit() decorator accelerates calculation speed.
import matplotlib.pyplot as plt import numpy as np import seaborn as sns from sklearn.manifold import TSNE import umap sns.set_context('poster') sns.set_style('white') sns.set_color_codes() plot_kwds = {'alpha':0.5, 's':80, 'linewidth':0} #### snip #### #### load SDF and calculate fingerprint step #### #### snip #### import numba @numba.njit() def tanimoto_dist(a,b): dotprod = np.dot(a,b) tc = dotprod / (np.sum(a) + np.sum(b) - dotprod) return 1.0-tc tsne = TSNE(n_components=2, metric=tanimoto_dist) tsne_X = tsne.fit_transform(X) umap_X = umap.UMAP(n_neighbors=8, min_dist=0.3, metric=tanimoto_dist).fit_transform(X)
Next do HDBSCAN. It is simple just call hdbscan.HDBSCAN and fit data! ;-)
import hdbscan cluster_tsne = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True) cluster_umap = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True) cluster_tsne.fit(tsne_X) cluster_umap.fit(umap_X)
After clustering the object can make spanning tree plot by simple coding.
cluster_umap.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.6, node_size=90, edge_linewidth=2)
And make scatter plot with tSNE and UMAP data.
plt.scatter(tsne_X.T[0], tsne_X.T[1], c = cluster_umap.labels_, cmap='plasma') # image below plt.scatter(umap_X.T[0], umap_X.T[1], c = cluster_umap.labels_, cmap='plasma') # image below
I think results of UMAP and HDBSCAN are dependent on parameters but both library is easy to implement with several parameters. Nice example is provided from original repository.
And my whole code is pushed to my repository.
https://github.com/iwatobipen/chemo_info/blob/master/chemicalspace2/HDBSCAN_Chemoinfo.ipynb
Hi and thanks for sharing your implementation.
My question is probably very basic but I’m relatively new to python:
Does the fit_transform functions (in tSNE and UMAP) and the fit function (HDBSCAN) preserve the data order? So can I assume that X[i] correspond to tsne_X[i] and umap_X[i]?
Thank you for your comment.
Regarding API, fit function of DBSCAN(also HDBSCAN) returns cluster labels as ndarray, shape(n_samples) . fit_tranfrom of tSNE returns array shape (n_samples, n_components).
So My answer is yes it does.
I hope this will be help for you. ;-)
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
It definitely will. Thank you very much :)
You’re welcome!
Hi, thank you for the nice tutorial. I have one basic question as I am new in Python regarding your Tanimoto function. You said “Tanimoto dissimilarity”. So, if I want to compute t-SNE with as a metric Tanimoto distance between the compounds, I should change the function so it returns tc rather than 1.0-tc (Tanimoto being calculated as (c/ a + b – c)). Is it correct?
Hi Zinasi, thank you for your comment.
Default setting of TSNE uses euclidean distance instead of tanimoto distance(tanimoto dissimilarity in my post).
https://scikit-learn.org/0.15/modules/generated/sklearn.manifold.TSNE.html
TSNE uses ‘distance’ as metrics so the metrics of similar compounds should be zero. So I think you should use 1.0-tc instead of using tc for TSNE.
Thanks.