GetTanimtoSim/DistMat as a lower triangular matrix and convert 2D array #RDKit #Chemoinformatics

Similarity / Dissimilarity matrix is often used for not only checking the bulk of molecular similarity but also the chemical space (such as MDS).

The cost of pair similarity matrix is O(N^2). After calculation, I can get N x N similarity / dis similarity matrix.

I found some functions to calculate similarity metrics in rdkit named GetTanimotoDistMat and GetTanimotoSimMat. Official document is below.

These function compute the distance/simirality matrix from a list of BitVects using the Tanimoto distance metric. Returns 1D l-triangular matrix.

I used the function and after getting the results, I tried to convert 1D l-tri matrix to 2D symmetric matrix. At first I load some functions to do that.

%matplotlib inline
import os
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import seaborn as sns
from rdkit.DataManip.Metric import GetTanimotoDistMat
from rdkit.DataManip.Metric import GetTanimotoSimMat
from rdkit import rdBase
from rdkit.Chem import RDConfig

Then read cdk2.sdf as an example.

sdfdir = os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf')
mols = [m for m in Chem.SDMolSupplier(sdfdir)]

To use GetTanimotoSim(Dist)Mat, list of bit vector is required. So I made the list and calculate the matrix. Got MorganFP with common way in RDKit.

morganfps = [AllChem.GetMorganFingerprintAsBitVect(m,2) for m in mols]
distmat = GetTanimotoDistMat(morganfps)
simmat = GetTanimotoSimMat(morganfps)
print(distmat, len(distmat))
print(simmat, len(simmat))
> [0.53703704 0.51851852 0.43396226 ... 0.88659794 0.92079208  0.86538462] 1081
> [0.46296296 0.48148148 0.56603774 ... 0.11340206 0.07920792 0.13461538] 1081

OK it seems work well. By the way, how can I convert the l-tri mat to 2d array?

Mathematically, the number of similarity combination of M molecules are calculated with following equation.

N = M (M + 1)/2 – M

The above case, I have 47 molecules, so N = 47(1+47)/2 – 47 = 1081 ;)

So, I can calculate M from N with quadratic formula.

M = [1 + sqr(1 + 4 * 2 * N)]/2

Now ready to write code! It seems redundant. If reader have more smart code, please let me know ;)

def tri2mat(tri_arr):
    n = len(tri_arr)
    m = int((np.sqrt(1 + 4 * 2 * n) + 1) / 2)
    arr = np.ones([m, m])
    for i in range(m):
        for j in range(i):
            arr[i][j] = tri_arr[i + j - 1]
            arr[j][i] = tri_arr[i + j - 1]
    return arr

Check the code and visualize these matrix.

distarr = tri2mat(distmat)
simmat = tri2mat(simmat)

Molecules with index 8 and 9 showed very dissimilar for me and heat map showed the trend. And Molecules with index 0 and 1 showed high similarity.

I think it is rare case convert from tri-mat to symmetric-mat because tri-mat form is computer cost effective. Symmetric-mat form needs more memory.

The code is just my practice. I posted whole code to my gist.