Clustering molecules with HDBSCAN and predict chemical space with scikit-learn #chemoinformatics #RDKit

The contents of the post is almost same as yesterday’s one that was for souyaku-advent calendar 2019.

In drug discovery area we often discuss about compound chemical space. Chemist should be explore chemical space to find new IP space etc. It is defined with descriptors, fingerprint and biological response etc… It is often high dimensional space so we would like to reduce the dimension with some mathematical approach such as PCA, TSNE and UMAP.

And also we use clustering method for grouping molecules.

There are many clustering method for example k-means, hierarchical clustering. However k-means need to define k value and hierarchical method need to define threshold for doing that. So try and error is required. If you would like to get clustering label conveniently, I recommend to try to use HDBSCAN. HDBSCAN clusters data with their density. Ref URL is below.

Fortunately we can use HDBSCAN from python. It is easy to install via pip or conda. Please check original document to install. ;)

I tried to demo with HDBSCAN. Following data is bollowed from chaper8 in py4chemoinformatics (Thanks@fmkz___ for providing data and code!)

Let’s write code! Import packages at first. I’ll use mlinsght later.

%matplotlib inline
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
from rdkit import RDLogger
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.manifold import TSNE
from mlinsights.mlmodel import PredictableTSNE
from hdbscan import HDBSCAN

plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}
seed = 794

I defined tanimoto-dis tfunction for HDBSCAN. HDBSCAN requires metric arg. for measuring distances of each data point. And there isn’t tanimoto distance as a default metric. If user would like to original metric function, pass ‘pyfunc’ for metric arugment and your own function is passed to func argument. Func should be took two numpy array and return distance.
My tanimoto dist function is tooooooooooooooooooooooooo slow it has room for improvement I fill improve it later. ;)

oxrs = [("CHEMBL3098111", "Merck" ),  ("CHEMBL3867477", "Merck" ),  ("CHEMBL2380240", "Rottapharm" ),
             ("CHEMBL3352684", "Merck" ),  ("CHEMBL3769367", "Merck" ),  ("CHEMBL3526050", "Actelion" ),
             ("CHEMBL3112474", "Actelion" ),  ("CHEMBL3739366", "Heptares" ),  ("CHEMBL3739395", "Actelion" ), 
             ("CHEMBL3351489", "Eisai" )]

fps = []
docs = []
companies = []
mol_list = []
for cid, company in oxrs:
    sdf_file = os.path.join("data", cid + ".sdf")
    mols = Chem.SDMolSupplier(sdf_file)
    for mol in mols:
        if mol is not None:
            fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
            arr = np.zeros((1,))
            DataStructs.ConvertToNumpyArray(fp, arr)
fps = np.array(fps)
companies = np.array(companies)def tanimoto_dist(ar1, ar2):
    a =, ar2)
    b = ar1 + ar2 - ar1*ar2
    return 1 - a/np.sum(b)

clusterer = HDBSCAN(algorithm='best', min_samples=5, metric='pyfunc', func=tanimoto_dist)
docs = np.array(docs)

trainIDX, testIDX = train_test_split(range(len(fps)), random_state=seed)

Now ready, let’s check data without and with HDBSCAN. I made two plots one is coloured with company name and the other is coloured with HDBCAN label.

tsne = TSNE(random_state=seed)
res = tsne.fit_transform(fps)
plt.figure(figsize=(12, 6))
sns.scatterplot(res[:,0], res[:,1], hue=companies, **plot_kwds)

plt.figure(figsize=(12, 6))
palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat)
                 if col >= 0 else (0.5, 0.5, 0.5) for col, sat in zip(clusterer.labels_, clusterer.probabilities_)]
plt.scatter(res[:,0], res[:,1], c=cluster_colors, **plot_kwds)
coloured with company name
coloured with hdbscan label.

Coloured with Gray color point is not labelled data. The result shows HDBSCAN could do almost reasonable labelling in this case.

Plot new compound with defined chemical space

Next topic is how to plot new compound in pre-defined chemical space.
Theoretically, PCA and tSNE can’t add new data to pre-defined latent space. Re calculation is required if user would like to add new data. But we would like to map new designed compounds to current chemical space. Hmm…. how to do it?

The answer for the question is provided from mlinghts. The package has PredictableTSNE class. The class take transformer and estimator object. And it makes predictive model for transformer output (such as PCA, TSNE output) with estimator. It will be clear if you read source code. After learning, PredictableTSNE can predict latent space from data features.

Following code, I used random sampling and random forest and GP regressor for estimator.

trainFP = [fps[i] for i in trainIDX]
train_mol = [mol_list[i] for i in trainIDX]

testFP = [fps[i] for i in testIDX]
test_mol = [mol_list[i] for i in testIDX]
allFP = trainFP + testFP
tsne_ref = TSNE(random_state=seed)
res = tsne_ref.fit_transform(allFP)
plt.figure(figsize=(12, 6))
sns.scatterplot(res[:,0], res[:,1], hue=['train' for i in range(len(trainFP))] + ['test' for i in range(len(testFP))])
blue is training data, orange is test data.

Use predictiveTSNE for new compound chemical space prediction. Following examples are in RF case and GP case.

rfr = RandomForestRegressor(random_state=seed)
tsne1 = TSNE(random_state=seed)
pred_tsne_rfr = PredictableTSNE(transformer=tsne1, estimator=rfr, keep_tsne_outputs=True), list(range(len(trainFP))))

pred1 = pred_tsne_rfr.transform(testFP)
plt.figure(figsize=(12, 6))
plt.scatter(pred_tsne_rfr.tsne_outputs_[:,0], pred_tsne_rfr.tsne_outputs_[:,1], c='blue', alpha=0.5)
plt.scatter(pred1[:,0], pred1[:,1], c='red', alpha=0.5)

gbr = GaussianProcessRegressor(random_state=seed)
tsne2 = TSNE(random_state=seed)
pred_tsne_gbr = PredictableTSNE(transformer=tsne2, estimator=gbr, keep_tsne_outputs=True), list(range(len(trainFP)))pred2 = pred_tsne_gbr.transform(testFP)
plt.figure(figsize=(12, 6))
plt.scatter(pred_tsne_gbr.tsne_outputs_[:,0], pred_tsne_gbr.tsne_outputs_[:,1], c='blue', alpha=0.5)
plt.scatter(pred2[:,0], pred2[:,1], c='red', alpha=0.5))
RF as an estimator
GP as an estimator

In above case, random forest seems better performance for new compound prediction.

The method of estimator has a big influence for preditiveTSNE performance. It is useful for plot new data but it has risk for miss-leading for design.

At the end, I checked cluster label and preditiveTSNE data.

allmol = train_mol + test_mol
fps2 = []
clusterer2 = HDBSCAN(algorithm='best', min_samples=5, metric='pyfunc', func=tanimoto_dist)
for mol in allmol:
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
fps2 = np.array(fps2)
plt.figure(figsize=(12, 6))
plt.scatter(pred_tsne_rfr.tsne_outputs_[:,0], pred_tsne_rfr.tsne_outputs_[:,1], c=clusterer2.labels_[:len(trainFP)], alpha=0.5, marker='x')
plt.scatter(pred1[:,0], pred1[:,1], c=clusterer2.labels_[len(trainFP):], alpha=0.5, marker='o')

x training data, o test data, colour : class label.

Not only training data but also test data which has same colour are plotted almost same area. It indicates that the approach works well in this case.

Finally we often discuss chemicals pace and use the term ‘chemical space’ but it is very difficult to define, use and compare it.

All code is uploaded in my gist. Any comments and/or suggestions are greatly appreciated.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: