Similarity / Dissimilarity matrix is often used for not only checking the bulk of molecular similarity but also the chemical space (such as MDS).
The cost of pair similarity matrix is O(N^2). After calculation, I can get N x N similarity / dis similarity matrix.
I found some functions to calculate similarity metrics in rdkit named GetTanimotoDistMat and GetTanimotoSimMat. Official document is below.
These function compute the distance/simirality matrix from a list of BitVects using the Tanimoto distance metric. Returns 1D l-triangular matrix.
I used the function and after getting the results, I tried to convert 1D l-tri matrix to 2D symmetric matrix. At first I load some functions to do that.
%matplotlib inline import os from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import DataStructs import numpy as np from rdkit.Chem.Draw import IPythonConsole from rdkit.Chem import Draw import seaborn as sns from rdkit.DataManip.Metric import GetTanimotoDistMat from rdkit.DataManip.Metric import GetTanimotoSimMat from rdkit import rdBase from rdkit.Chem import RDConfig print(rdBase.rdkitVersion) >2019.03.2
Then read cdk2.sdf as an example.
sdfdir = os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf') mols = [m for m in Chem.SDMolSupplier(sdfdir)]
To use GetTanimotoSim(Dist)Mat, list of bit vector is required. So I made the list and calculate the matrix. Got MorganFP with common way in RDKit.
morganfps = [AllChem.GetMorganFingerprintAsBitVect(m,2) for m in mols] distmat = GetTanimotoDistMat(morganfps) simmat = GetTanimotoSimMat(morganfps) print(distmat, len(distmat)) print(simmat, len(simmat)) print(len(mols)) > [0.53703704 0.51851852 0.43396226 ... 0.88659794 0.92079208 0.86538462] 1081 > [0.46296296 0.48148148 0.56603774 ... 0.11340206 0.07920792 0.13461538] 1081 >47
OK it seems work well. By the way, how can I convert the l-tri mat to 2d array?
Mathematically, the number of similarity combination of M molecules are calculated with following equation.
N = M (M + 1)/2 – M
The above case, I have 47 molecules, so N = 47(1+47)/2 – 47 = 1081 ;)
So, I can calculate M from N with quadratic formula.
M = [1 + sqr(1 + 4 * 2 * N)]/2
Now ready to write code! It seems redundant. If reader have more smart code, please let me know ;)
def tri2mat(tri_arr): n = len(tri_arr) m = int((np.sqrt(1 + 4 * 2 * n) + 1) / 2) arr = np.ones([m, m]) for i in range(m): for j in range(i): arr[i][j] = tri_arr[i + j - 1] arr[j][i] = tri_arr[i + j - 1] return arr
Check the code and visualize these matrix.
distarr = tri2mat(distmat) simmat = tri2mat(simmat) sns.heatmap(simmat[:10,:10])
Molecules with index 8 and 9 showed very dissimilar for me and heat map showed the trend. And Molecules with index 0 and 1 showed high similarity.
I think it is rare case convert from tri-mat to symmetric-mat because tri-mat form is computer cost effective. Symmetric-mat form needs more memory.
The code is just my practice. I posted whole code to my gist.