Chemical space visualization and clustering with HDBSCAN and RDKit #RDKit

I caught a flu last Monday. So I stay home and rest in this week…… 😦
I’m having a fever and hope will get better soon.
BTW, recently I found new technique for dimension reduction called Uniform Manifold Approximation and Projection (UMAP). It was also topics in my twitter’s TL.
URL links of original paper and github repository are below.
The repository provides example notebook how does UMAP works. And it is very easy to perform the analysis. I tried to use UMAP with almost default settings and to perform clustering with the data.
To perform the clustering I used HDBSCAN today it is based on DBSCAN. DBSCAN is density based clustering method and it is not required number of clusters for input. It is main difference for other clustering method.
DBSCAN is implemented Scikit-learn so it is easy to perform it.
One of difficulty of DBSCAN is parameter selection and the handling of variable density clusters.
In Fig1 of the original paper[*] shows difference between k-means, DBSCAN and HDBSCAN. DBSCAN can not detect some clusters but HDBSCAN can detect them it is interesting for me.
** [DBSCAN original paper]
HDBSCAN is hierarchical clustering algorithm but it works so fast. It is good point of the algorithm. It uses ‘runt pruning’ with parameter m ‘minimum cluster size’. Any branch of the cluster tree that represents a cluster of less than the size is pruned out of the tree.

Good news! HDBSCAN is python package for unsupervised learning to find clusters. So you can install HDBSCAN via pip or conda. ⭐️

Now move to code. I used GSK3b inhibitor as dataset and each Fingerprint was calculated with RDKit MorganFP.
Then perfomed tSNE and UMAP with original metrics ‘Tanimoto dissimilarity’.
@number.njit() decorator accelerates calculation speed.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.manifold import TSNE
import umap
plot_kwds = {'alpha':0.5, 's':80, 'linewidth':0}

#### snip ####
#### load SDF and calculate fingerprint step ####
#### snip ####

import numba

def tanimoto_dist(a,b):
    dotprod =,b)
    tc = dotprod / (np.sum(a) + np.sum(b) - dotprod)
    return 1.0-tc

tsne = TSNE(n_components=2, metric=tanimoto_dist)
tsne_X = tsne.fit_transform(X)
umap_X = umap.UMAP(n_neighbors=8, min_dist=0.3, metric=tanimoto_dist).fit_transform(X)

Next do HDBSCAN. It is simple just call hdbscan.HDBSCAN and fit data! πŸ˜‰

import hdbscan
cluster_tsne = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
cluster_umap = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)

After clustering the object can make spanning tree plot by simple coding.


Image like ..

And make scatter plot with tSNE and UMAP data.

plt.scatter(tsne_X.T[0], tsne_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below
plt.scatter(umap_X.T[0], umap_X.T[1], c = cluster_umap.labels_, cmap='plasma')
# image below



I think results of UMAP and HDBSCAN are dependent on parameters but both library is easy to implement with several parameters. Nice example is provided from original repository.

And my whole code is pushed to my repository.


mol encoder with Pytorch

Variable Auto Encoder (VAE) is unique method that is used for learning latent representations. VAE encodes discriminative vector to continuous vector in latent space. There are lots of examples in github.
In 2016, AlΓ‘n Aspuru-Guzik reported new de novo design method by using VAE. The approach represents molecules as SMLIES and SMILES strings are converted one-hot vector. The vectors used for deep learning.
And Maxhodak implemented molencoder with Keras! It is nice work and the strong impact is given to the chemoinformatics area I think.
Repository is below.

The code depends on keras 1.1.1 and python 2.7 and some packages. Recently keras version is 2.x and major code supports python 3.x. I would like to test mol VAE in python 3.x environment. Also I am now learning pytorch, so I would like to convert the code from keras based to pytorch based.
I spent several days for coding, but I found that it was pioneer!
Researcher who is Pande’s group published nice work.

The molencoder works on python3.5, 3.6, and CUDA-8.0 environments with pytorch. The code seems almost same as keras-molecules but there are some differences.
Main difference is activation function. The original code uses ReLU but the code uses SeLU for activation.
I have not check the effect of difference of the function.
I wrote interpolation script by using molencoder.
The code is below.

Description about some arguments.
SOUCE; The starting point of interpolation.
TARGET; The goal point of the experiment.
STEPS; How many steps the model will generate SMILES strings from latent space between source and target.
* Following code will run only with GPU machine because I hard-coded “someprop.cuda()” instead of using if function.

import numpy as np
import torch
import argparse
from torch.autograd import Variable
from rdkit import Chem
from molencoder.models import MolEncoder, MolDecoder
from molencoder.models import MolEncoder, MolDecoder
from molencoder.utils import( load_dataset, initialize_weights,ReduceLROnPlateau, save_checkpoint, validate_model)
from molencoder.featurizers import Featurizer, OneHotFeaturizer

SOURCE = 'c1ccccn1'
DEST =  'c1ccccc1'
STEPS = 200
#charset from chembl

charset = [' ', '#', '%', '(', ')', '+', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '=', '@', 'A', 'B', 'C', 'F', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'S', 'T', 'V', 'X', 'Z', '[', '\\', ']', 'a', 'c', 'e', 'g', 'i', 'l', 'n', 'o', 'p', 'r', 's', 't']

def decode_smiles_from_index(vec, charset):
    return "".join(map(lambda x:charset[x],vec)).strip()

def get_arguments():
    parser = argparse.ArgumentParser(description="Interpolate from source to dest in steps")
    parser.add_argument("--source", type=str, default=DEST)
    parser.add_argument("--dest", type=str, default=SOURCE)
    parser.add_argument("--steps", type=int, default=STEPS)
    return parser.parse_args()

def interpolate(source, dest, steps, charset, encoder, decoder):
    source_just = source.ljust(width)
    dest_just = dest.ljust(width)
    onehot = OneHotFeaturizer(charset=charset)
    sourcevec = onehot.featurize(smiles=[source_just])
    destvec = onehot.featurize(smiles=[dest_just])
    source_encoded = Variable(torch.from_numpy(sourcevec).float()).cuda()
    dest_encoded = Variable(torch.from_numpy(destvec).float()).cuda()
    source_x_latent = encoder(source_encoded)
    dest_x_latent = encoder(dest_encoded)
    step = (dest_x_latent-source_x_latent)/float(steps)
    results = []
    for i in range(steps):
        item = source_x_latent + (step*i)
        sampled = np.argmax(decoder(item).cpu().data.numpy(), axis=2)
        decode_smiles = decode_smiles_from_index(sampled[0], charset)
        results.append((i, item, decode_smiles))
    return results

def main():
    args= get_arguments()
    encoder = MolEncoder( c = len(charset))
    decoder = MolDecoder( c = len(charset))
    print( torch.cuda.is_available() )
    encoder = MolEncoder( c = len(charset)).cuda()
    decoder = MolDecoder( c = len(charset)).cuda()
    bestmodel = torch.load("model_best.pth.tar")
    #bestmodel = torch.load("tempcheckpoint.pth.tar")

    results = interpolate( args.source, args.dest, args.steps, charset, encoder, decoder )
    for result in results:
        print(result[0], result[2])

if __name__=="__main__":

Just a note for me.

Peptide design x Deep learning

You know recurrent neural network (RNN) is universally used in machine learning for natural language, handwriting, speech and also chemistry.
Recently there are lots of reports that use RNN against SMILES strings to solve chemoinformatics problems. Today I read a short article published from Prof. Gisbert Schneider’s group.
URL is below.

They applied RNN ( LSTM ) for designing of antimicrobial peptides(AMPs). The strategy is basic. First added tag to peptide sequence and padded fixed length. Then encoded one hot vector.
I think key point of their method is selection of training peptides. They removed the sequences that containing Cys because Cys residues potentially forming S-S bridges. It will complicate problems.

Finally they evaluate trained model and the model generate novel peptides that have suitable hydrophobic nature and length.
I think their strategy (remove Cys residues) is nice and fit to RNN.
BTW, regarding the method machine learns peptides as bunch of strings but does not lean features of each amino acid. This is same as SMILES in chemoinformatics area.
I have no answer about it.

If reader who is interested in the approach you can get source code from following URL.

Applications of Fluorine atom in Drug discovery

Some years ago, there was good review in drug discovery about the applications of fluorine. The perspective was published by researchers in BMS. There were many informations about fluorine based on their experience and published data. I think this is still useful for Med Chem.
It was published in 2015 from ACS.

And recently same author who is researcher in BMS reported new review about the review!
I skimmed the review today (59 page! Too long for me ;-)). There are some examples that were reported in previous review but there are lots of new insights and examples of fluorine. The article mainly focused on
Bioisosteric replacement of molecules with fluorine. Bioisosteric replacement is often used in drug discovery to not only maintain potency but also improve metabolic stability, solubility or any parameters.
For example, the author describes about replacement from tert-butyl group to tri-fluoro cyclopropane analogue in “Table3”. It was interesting for me because it is not simple replacement, from tert-Bu to try-fluoro-dimethly group. Also there are some same replacement examples in different protein targets.

Strategy of metabolic block with fluorine atoms is some time easy to understand and medicinal chemists try to introduce fluorine atoms in their compounds. But application of fluorine is not limited in the strategy. An interesting examples are described in the article.
Introduction Fluorine atom in aromatic ring in Compound 174 can improve solubility of the compound from 15mg/mL to >500 mg/ml. The effect of the fluorine is not clear but the author describes the fluorine atom polarizing the adjacent Nβˆ’H affects the solubility.

And I surprised because fluorine atom strategy is also effective for peptide drug discovery. In table 29, fluoro- derivatives of 36-residue peptide derived from amino terminus of human parathyroid hormone (hPTH) have binding potency for PTH receptor. If there are lots of cost effective fluorinated amino acids are available, can we design more potent peptide derivatives ?

New finding of fluorine effects creates new strategy for drug design. And sometime it is needed new chemistry to make fluorinated building blocks or conduct fluorination reactions.

Medicinal chemists need to catch up both new strategy for drug design and synthetic chemistry I think.

Rational design of GPCR biased Ligand

GPCR is one of druggable target. GPCR activation controls many networks of signaling pathways, which for most receptors are mediated by both G proteins and beta-arrestins. Different signaling pathways give different effects. To avoid side effects from G protein signals, designing beta-arrestins selective ligand is useful strategy for drug discovery. And there are lots of reports about biased-ligand from a few years ago.

I am interested in these area and following article found.
“structure-inspired design of b-arrestin-biased ligands for aminergic GPCRs”

The authors design selective biased ligand of D2 receptor by using homology modeling/SBDD and MD.

At first they focused in to TM5 and EL2 region where are important for G protein/beta arrestin selectivity. And design new molecule from Aripiprazole, replace from di-chloro to indole moiety (Compound 1). The compound 1 was biased.
Next they tried to design substituted analogue of compound1 and got clear SAR of the substituents. Also they performed MD simulation about the indole motif and revealed the effect of the substituents.

Finally they could rationally design more selective biased ligand than initial compound 1 Fig5. Bias index is 20 vs 2 (compound7 vs compound1)

It was interesting for me because all molecules have quite similar structure but little difference affect protein-ligand contacts and can control their signaling pathway!

And computational approach helps rational biased drug design. I feel Low-molecular drug discovery is still exciting area of science.

BTW, in the article Aripiprazole is used for starting point.
Aripiprazole is one of major drug for schizophrenia and bipolar disorder. And Rexulti is also approved drug for schizophrenia and major depression. Structural difference of these molecules is a tail part, di chloro benzene or benzothiophene.

These compounds show different pharmacological profiles.
Also there are difference in metabolic profiles.
Receptor Rexulti Abilify(Ki nM)
5-HT1A 0.12 5.6
5-HT2A 0.47 8.7
5-HT2B 1.9 0.4
D2 0.3 1.6
D3 1.1 5.4
H1 19 27.9
a1b 0.17 34.4
a2c 0.59 37.6

I am now interested in the patent strategy. I will check it.

Think about HTS

Recently there are many approaches for starting drug discovery, i.e. FBDD, SBDD, LBDD, phenotypic, HTS and etc.
High Throughput Screening (HTS) was common strategy for drug discovery when I joined my company.γ€€That’s right even now but it is not always appropriate strategy.

For pharma, it requires a lot of investment to maintain their own large compounds screening deck and screening capability. And now there are lots of CROs which has their own library and can perform HTS. HTS can be outsourced now.

I searched HTS trends in Google Trends. It shows decreasing of trends and I think the trend shows HTS is matured technology now.

To perform HTS, it requires compound library. Some years ago compound library design is competitive area but it moves to pre competitive area I think. For example European Lead Factory ( ELF ), J-CLIC ( Japan Compound Library Consortium ) is involved multiple companies. Also many drug likeness indicators or filters such as LE, LLE, QED, PAINS are published in journal and the information spread rapidly. Everyone can apply these roles simply.

Hit rate of HTS is not so high but need lots of investment. However if HTS provides novel and drug like seed compounds it is good starting point for drug discovery. Also HTS can be applied structure unknown targets.
My opinion HTS is matured tech but not dead tech. We(I?) need HTS for seed finding.
If reader who has comments I appreciate it.

Get 3D feature descriptors from PDB file

If reader is interested in drug discovery, chemoinformatics, deep learning or MD, I think the reader might read the article below.
KDEEP is predictor that uses Deep learning(CNN) for affinity prediction. Regarding the article, I found the new python library named HTMD ( High Through Put Molecular Dynamics ). Really I am not good at Computational area such as ab initio or MD calculation. So I can not evaluate the real value of the library but I think it is useful for Comp Chemist.
HTMD can be installed from acellera channel of anaconda with some dependencies. Or can be build from source from github.
One of the function I was interested was “get Voxel descriptors”. A voxel represents a value on a regular grid in three-dimensional space. As same as pixel in 2D space.
It means HTMD can produce 3D descriptor from mol file. And It is easy to calculate the descriptor by using the library. Also HTMD has lots of functions to manipulate PDB or SDF or any file format files.
Following code is example to calculate the descriptors.

from htmd.ui import *
from htmd.molecule import voxeldescriptors
from htmd.molecule.util import boundingBox
# easy to read pdb by using Molecule.
# Also easy to get focused residue id 
atp = Molecule('1atp.pdb')
lck = Molecule('2pl0.pdb')
print(atp.numAtoms, lck.numAtoms)
print(atp.numResidues, lck.numResidues)
atp.get('resid', sel='resname CYS') # get residue id of cysteines
3070 2312
462 362
array([199, 199, 199, 199, 199, 199, 343, 343, 343, 343, 343, 343])
# get sequence

getVoxelDescriptors calculates feature of the mol object and return features as array, centers of voxel.
usercenters is user defined center of voxel. Following example is but example because there are voxels of oblique direction…
If user do not define usercenters the function return suitable size of features but it depends on input file. It means the function returns atypical shape of features.
BTW, each voxel has 8 features(see results of atpeat.shape). The features define the 8 feature of the voxel, (‘hydrophobic’, ‘aromatic’, ‘hbond_acceptor’, ‘hbond_donor’, ‘positive_ionizable’, ‘negative_ionizable’, ‘metal’, ‘occupancies’).
And parameters of these features from Autodock 4.
So the performance of the predictive models using the voxelfeatures depends on performance of Autodock4. But useful for applying machine learning because easy to calculate the 3D features.

Recently there are lots of tools for calculation of molecular features. Which tool shows best performance? πŸ™‚
I do not have the answer but I am interested in the area. I would like to discuss with Comp Chemist.
And I would like to calculate voxelDescriptors in fixed area around the binding ligand.
I wii be happy if reader who is good at HTMD give the hint or tips to do that.

atpfeat, atpcenters = voxeldescriptors.getVoxelDescriptors(atp, usercenters=np.array([[i,i,i] for i in range(100)]), buffer=8)
lckfeat, lckcenters = voxeldescriptors.getVoxelDescriptors(lck, usercenters=np.array([[i,i,i] for i in range(100)]), buffer=8)
print(atpfeat.shape, lckfeat.shape)
print(atpcenters.shape, lckcenters.shape)
(100, 8) (100, 8)
(100, 3) (100, 3)
array([[3.81532829e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [9.99980260e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.08001621e-01, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.64279406e-01],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00],