Recently I found an article that describe new method for fast fingerprint calculation.
You can read the article from chemrxiv, URL is below.
They used MinHash method.
MinHash method is the way to estimate jaccard similarity very efficiently. The authors developed MHFP (MinHash Fingerprint) and compared the performance with ECFP4.
? MinHash ?
They discussed the performance of MFHP6 (6 means radius 3) and the FP generally outperforms MHFP4, ECFPxs.
In fig6. shows performance analysis of k-nearest neighbor search and MHFP6 works very nice and fast.
Fortunately, author disclosed source code on github. You can use it if you would like to use it.
Now I tried to use it and compared similarity between ECFP and MHFP.
Code is below.
%matplotlib inline import matplotlib.pyplot as plt import numpy as np from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import DataStructs from mhfp.encoder import MHFPEncoder mhfp_encoder = MHFPEncoder() /sourcecode] Calculate fingerprints! mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None] nmols = len(mols) #Calc morgan fp mg2fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 3) for mol in mols] #Calc min hash fp mhfps = [mhfp_encoder.encode_mol(mol) for mol in mols]
tanimoto_sim =  for i in range(nmols): for j in range(i): tc = DataStructs.TanimotoSimilarity(mg2fps[i], mg2fps[j]) tanimoto_sim.append(tc) mhfps_sim =  for i in range(nmols): for j in range(i): jaccard = 1. - MHFPEncoder.distance(mhfps[i], mhfps[j]) mhfps_sim.append(jaccard)
a, b = np.polyfit(tanimoto_sim, mhfps_sim, 1) y2 = np.int64(a) * tanimoto_sim + np.int64(b) print(a, b) > 1.033917242502858 -0.031604772419224866
This results shows ECFP6 and MHFP6 has good correlation I think.
Finally I made a simple scatter plot.
plt.scatter(tanimoto_sim, mhfps_sim) plt.plot(tanimoto_sim, y2, color='black') plt.xlabel('tanimoto') plt.ylabel('mhfp sim')
All code is pushed to my repo.
In summary, I tried to use MHFP and it shows good correlation with ECFP.
I used very small dataset(47 molecules), so it can not check speed for large dataset.
I would like to check it near the future.
Last week, I participated CBI and a software UGM.
I am happy that I could have fruitful discussions. I could get many ideas for next challenge!