Visualize chemical space is important for medicinal chemist I think. Recently, Prof. Bajorath group published nice article. URL is below.
The author described new approach that combines SARMatrix and Molecular Grid maps. SARMatrics is one of the method for SAR analysis like Free Wilson analysis.
I had interest their approach because they uses molecular grid maps. I often use PCA and/or t-SNE for chemical space mapping but it is not grid.
Molecular grid maps is like SOM. To make the maps, they used J-V algorithms. The details are described in following URL.
I would like to try the mapping method.
Fortunately python package for JV-algorithm is provided in Github! The name is lapjb. And I installed it and try to use it.
My code is below.
At first, import packages and load data. Sample data came from CHEMBL.
%matplotlib inline import matplotlib.pyplot as plt from rdkit import Chem from rdkit.Chem.Draw import IPythonConsole from rdkit.Chem import Draw from rdkit.Chem import AllChem from rdkit.Chem import DataStructs from sklearn.decomposition import PCA from sklearn.manifold import TSNE import pandas as pd import numpy as np from scipy.spatial.distance import cdist df = pd.read_csv('CHEMBL3888506.tsv', sep='\t', header=0)
To make grid mapping, number of sample must be N^2. The dataset has 467 molecules, so I used 400 molecule for test. It means I will embed 20 x 20 grid space.
mols = [Chem.MolFromSmiles(smi) for smi in df.Smiles] sampleidx = np.random.choice(list(range(len(mols))), size=400, replace=False) samplemols = [mols[i] for i in sampleidx] sampleact = [9-np.log10(df['Standard Value'][idx]) for idx in sampleidx]
fps = [AllChem.GetMorganFingerprintAsBitVect(m,2) for m in samplemols] def fp2arr(fp): arr = np.zeros((0,)) DataStructs.ConvertToNumpyArray(fp,arr) return arr X = np.asarray([fp2arr(fp) for fp in fps])
Then perform PCA-t-SNE for getting chemical space and normalize the data.
size = 20 N = size*size data = PCA(n_components=100).fit_transform(X.astype(np.float32)) embeddings = TSNE(init='pca', random_state=794, verbose=2).fit_transform(data) embeddings -= embeddings.min(axis=0) embeddings /= embeddings.max(axis=0)
Check the t-SNE result. Activity is used for color mapping.
plt.scatter(embeddings[:,0], embeddings[:,1], c=sampleact, cmap='hsv')
Next let’s projection chemical space to grid. Usage of lapjv is very simple. At first calculate similarity matrix with scipy cdist function. Then pass the matrix to lapjv.
grid = np.dstack(np.meshgrid(np.linspace(0,1,size), np.linspace(0,1,size))).reshape(-1,2) from lapjv import lapjv cost_mat = cdist(grid, embeddings, 'sqeuclidean').astype(np.float32) cost_mat2 = cost_mat * (10000 / cost_mat.max()) row_asses, col_asses, _ = lapjv(cost_mat2) grid_lap = grid[col_asses]
Now ready. Let’s plot grid map.
plt.scatter(grid_lap[:,0], grid_lap[:,1], c=sampleact, cmap='hsv')
It seems work well. Same color dot is located near space.
Grid plot is useful because it can avoid overlapping of each dot.
The author developed more sophisticated tool. However the source code is not disclosed. It seems very attractive for medchem ;-)
Rational drug design from computer assist is very important. But I think visualization and analysis method for medicinal chemist is equally important for drug design.
Today’s code is below.
Reader who has interest in lapjv, please try it.