Visualize chemical space as a grid #chemoinformatics #rdkit

Visualize chemical space is important for medicinal chemist I think. Recently, Prof. Bajorath group published nice article. URL is below.

https://pubs.acs.org/doi/pdf/10.1021/acsomega.9b00595

The author described new approach that combines SARMatrix and Molecular Grid maps. SARMatrics is one of the method for SAR analysis like Free Wilson analysis.

I had interest their approach because they uses molecular grid maps. I often use PCA and/or t-SNE for chemical space mapping but it is not grid.

Molecular grid maps is like SOM. To make the maps, they used J-V algorithms. The details are described in following URL.
https://link.springer.com/article/10.1007/BF02278710

I would like to try the mapping method.

Fortunately python package for JV-algorithm is provided in Github! The name is lapjb. And I installed it and try to use it.
https://github.com/src-d/lapjv
My code is below.

At first, import packages and load data. Sample data came from CHEMBL.

%matplotlib inline
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist
df = pd.read_csv('CHEMBL3888506.tsv', sep='\t', header=0)

To make grid mapping, number of sample must be N^2. The dataset has 467 molecules, so I used 400 molecule for test. It means I will embed 20 x 20 grid space.

mols = [Chem.MolFromSmiles(smi) for smi in df.Smiles]
sampleidx = np.random.choice(list(range(len(mols))), size=400, replace=False)
samplemols = [mols[i] for i in sampleidx]
sampleact = [9-np.log10(df['Standard Value'][idx]) for idx in sampleidx]
fps = [AllChem.GetMorganFingerprintAsBitVect(m,2) for m in samplemols]
def fp2arr(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp,arr)
    return arr
X = np.asarray([fp2arr(fp) for fp in fps])

Then perform PCA-t-SNE for getting chemical space and normalize the data.

size = 20
N = size*size
data = PCA(n_components=100).fit_transform(X.astype(np.float32))
embeddings = TSNE(init='pca', random_state=794, verbose=2).fit_transform(data)
embeddings -= embeddings.min(axis=0)
embeddings /= embeddings.max(axis=0)

Check the t-SNE result. Activity is used for color mapping.

plt.scatter(embeddings[:,0], embeddings[:,1], c=sampleact, cmap='hsv')

Next let’s projection chemical space to grid. Usage of lapjv is very simple. At first calculate similarity matrix with scipy cdist function. Then pass the matrix to lapjv.

grid = np.dstack(np.meshgrid(np.linspace(0,1,size), np.linspace(0,1,size))).reshape(-1,2)
from lapjv import lapjv
cost_mat = cdist(grid, embeddings, 'sqeuclidean').astype(np.float32)
cost_mat2 = cost_mat * (10000 / cost_mat.max())
row_asses, col_asses, _ = lapjv(cost_mat2)
grid_lap = grid[col_asses]

Now ready. Let’s plot grid map.

plt.scatter(grid_lap[:,0], grid_lap[:,1], c=sampleact, cmap='hsv')

It seems work well. Same color dot is located near space.

Grid plot is useful because it can avoid overlapping of each dot.

The author developed more sophisticated tool. However the source code is not disclosed. It seems very attractive for medchem ;-)

Rational drug design from computer assist is very important. But I think visualization and analysis method for medicinal chemist is equally important for drug design.

Today’s code is below.

code example

Reader who has interest in lapjv, please try it.