The contents of the post is almost same as yesterday’s one that was for souyaku-advent calendar 2019.
In drug discovery area we often discuss about compound chemical space. Chemist should be explore chemical space to find new IP space etc. It is defined with descriptors, fingerprint and biological response etc… It is often high dimensional space so we would like to reduce the dimension with some mathematical approach such as PCA, TSNE and UMAP.
And also we use clustering method for grouping molecules.
There are many clustering method for example k-means, hierarchical clustering. However k-means need to define k value and hierarchical method need to define threshold for doing that. So try and error is required. If you would like to get clustering label conveniently, I recommend to try to use HDBSCAN. HDBSCAN clusters data with their density. Ref URL is below.
https://link.springer.com/chapter/10.1007%2F978-3-642-37456-2_14
Fortunately we can use HDBSCAN from python. It is easy to install via pip or conda. Please check original document to install. ;)
I tried to demo with HDBSCAN. Following data is bollowed from chaper8 in py4chemoinformatics (Thanks@fmkz___ for providing data and code!)
Let’s write code! Import packages at first. I’ll use mlinsght later.
%matplotlib inline import os import numpy as np import matplotlib.pyplot as plt import seaborn as sns from rdkit import Chem, DataStructs from rdkit.Chem import AllChem, Draw from rdkit import RDLogger from sklearn.model_selection import train_test_split from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.manifold import TSNE from mlinsights.mlmodel import PredictableTSNE from hdbscan import HDBSCAN sns.set_context('poster') sns.set_style('white') sns.set_color_codes() plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0} RDLogger.DisableLog('rdApp.*') seed = 794
I defined tanimoto-dis tfunction for HDBSCAN. HDBSCAN requires metric arg. for measuring distances of each data point. And there isn’t tanimoto distance as a default metric. If user would like to original metric function, pass ‘pyfunc’ for metric arugment and your own function is passed to func argument. Func should be took two numpy array and return distance.
@@@@@
My tanimoto dist function is tooooooooooooooooooooooooo slow it has room for improvement I fill improve it later. ;)
@@@@@
oxrs = [("CHEMBL3098111", "Merck" ), ("CHEMBL3867477", "Merck" ), ("CHEMBL2380240", "Rottapharm" ), ("CHEMBL3352684", "Merck" ), ("CHEMBL3769367", "Merck" ), ("CHEMBL3526050", "Actelion" ), ("CHEMBL3112474", "Actelion" ), ("CHEMBL3739366", "Heptares" ), ("CHEMBL3739395", "Actelion" ), ("CHEMBL3351489", "Eisai" )] fps = [] docs = [] companies = [] mol_list = [] for cid, company in oxrs: sdf_file = os.path.join("data", cid + ".sdf") mols = Chem.SDMolSupplier(sdf_file) for mol in mols: if mol is not None: mol_list.append(mol) fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2) arr = np.zeros((1,)) DataStructs.ConvertToNumpyArray(fp, arr) docs.append(cid) companies.append(company) fps.append(arr) fps = np.array(fps) companies = np.array(companies)def tanimoto_dist(ar1, ar2): a = np.dot(ar1, ar2) b = ar1 + ar2 - ar1*ar2 return 1 - a/np.sum(b) clusterer = HDBSCAN(algorithm='best', min_samples=5, metric='pyfunc', func=tanimoto_dist) clusterer.fit(fps) docs = np.array(docs) trainIDX, testIDX = train_test_split(range(len(fps)), random_state=seed)
Now ready, let’s check data without and with HDBSCAN. I made two plots one is coloured with company name and the other is coloured with HDBCAN label.
tsne = TSNE(random_state=seed) res = tsne.fit_transform(fps) plt.clf() plt.figure(figsize=(12, 6)) sns.scatterplot(res[:,0], res[:,1], hue=companies, **plot_kwds) plt.clf() plt.figure(figsize=(12, 6)) palette = sns.color_palette() cluster_colors = [sns.desaturate(palette[col], sat) if col >= 0 else (0.5, 0.5, 0.5) for col, sat in zip(clusterer.labels_, clusterer.probabilities_)] plt.scatter(res[:,0], res[:,1], c=cluster_colors, **plot_kwds)


Coloured with Gray color point is not labelled data. The result shows HDBSCAN could do almost reasonable labelling in this case.
Plot new compound with defined chemical space
Next topic is how to plot new compound in pre-defined chemical space.
Theoretically, PCA and tSNE can’t add new data to pre-defined latent space. Re calculation is required if user would like to add new data. But we would like to map new designed compounds to current chemical space. Hmm…. how to do it?
The answer for the question is provided from mlinghts. The package has PredictableTSNE class. The class take transformer and estimator object. And it makes predictive model for transformer output (such as PCA, TSNE output) with estimator. It will be clear if you read source code. After learning, PredictableTSNE can predict latent space from data features.
Following code, I used random sampling and random forest and GP regressor for estimator.
trainFP = [fps[i] for i in trainIDX] train_mol = [mol_list[i] for i in trainIDX] testFP = [fps[i] for i in testIDX] test_mol = [mol_list[i] for i in testIDX] allFP = trainFP + testFP tsne_ref = TSNE(random_state=seed) res = tsne_ref.fit_transform(allFP) plt.clf() plt.figure(figsize=(12, 6)) sns.scatterplot(res[:,0], res[:,1], hue=['train' for i in range(len(trainFP))] + ['test' for i in range(len(testFP))])

Use predictiveTSNE for new compound chemical space prediction. Following examples are in RF case and GP case.
rfr = RandomForestRegressor(random_state=seed) tsne1 = TSNE(random_state=seed) pred_tsne_rfr = PredictableTSNE(transformer=tsne1, estimator=rfr, keep_tsne_outputs=True) pred_tsne_rfr.fit(trainFP, list(range(len(trainFP)))) pred1 = pred_tsne_rfr.transform(testFP) plt.clf() plt.figure(figsize=(12, 6)) plt.scatter(pred_tsne_rfr.tsne_outputs_[:,0], pred_tsne_rfr.tsne_outputs_[:,1], c='blue', alpha=0.5) plt.scatter(pred1[:,0], pred1[:,1], c='red', alpha=0.5) gbr = GaussianProcessRegressor(random_state=seed) tsne2 = TSNE(random_state=seed) pred_tsne_gbr = PredictableTSNE(transformer=tsne2, estimator=gbr, keep_tsne_outputs=True) pred_tsne_gbr.fit(trainFP, list(range(len(trainFP)))pred2 = pred_tsne_gbr.transform(testFP) plt.clf() plt.figure(figsize=(12, 6)) plt.scatter(pred_tsne_gbr.tsne_outputs_[:,0], pred_tsne_gbr.tsne_outputs_[:,1], c='blue', alpha=0.5) plt.scatter(pred2[:,0], pred2[:,1], c='red', alpha=0.5))


In above case, random forest seems better performance for new compound prediction.
The method of estimator has a big influence for preditiveTSNE performance. It is useful for plot new data but it has risk for miss-leading for design.
At the end, I checked cluster label and preditiveTSNE data.
allmol = train_mol + test_mol fps2 = [] clusterer2 = HDBSCAN(algorithm='best', min_samples=5, metric='pyfunc', func=tanimoto_dist) for mol in allmol: fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2) arr = np.zeros((1,)) DataStructs.ConvertToNumpyArray(fp, arr) fps2.append(arr) fps2 = np.array(fps2) clusterer2.fit(fps2) plt.clf() plt.figure(figsize=(12, 6)) plt.scatter(pred_tsne_rfr.tsne_outputs_[:,0], pred_tsne_rfr.tsne_outputs_[:,1], c=clusterer2.labels_[:len(trainFP)], alpha=0.5, marker='x') plt.scatter(pred1[:,0], pred1[:,1], c=clusterer2.labels_[len(trainFP):], alpha=0.5, marker='o')

Not only training data but also test data which has same colour are plotted almost same area. It indicates that the approach works well in this case.
Finally we often discuss chemicals pace and use the term ‘chemical space’ but it is very difficult to define, use and compare it.
All code is uploaded in my gist. Any comments and/or suggestions are greatly appreciated.