Recently I found an article that describe new method for fast fingerprint calculation.
You can read the article from chemrxiv, URL is below.
https://chemrxiv.org/articles/A_Probabilistic_Molecular_Fingerprint_for_Big_Data_Settings/7176350
They used MinHash method.
MinHash method is the way to estimate jaccard similarity very efficiently. The authors developed MHFP (MinHash Fingerprint) and compared the performance with ECFP4.
”’
? MinHash ?
for example..
http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/
”’
They discussed the performance of MFHP6 (6 means radius 3) and the FP generally outperforms MHFP4, ECFPxs.
In fig6. shows performance analysis of k-nearest neighbor search and MHFP6 works very nice and fast.
Fortunately, author disclosed source code on github. You can use it if you would like to use it.
https://github.com/reymond-group/mhfp
Now I tried to use it and compared similarity between ECFP and MHFP.
Code is below.
@jupyter notebook
Load packages.
%matplotlib inline import matplotlib.pyplot as plt import numpy as np from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import DataStructs from mhfp.encoder import MHFPEncoder mhfp_encoder = MHFPEncoder() /sourcecode] Calculate fingerprints! mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None] nmols = len(mols) #Calc morgan fp mg2fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 3) for mol in mols] #Calc min hash fp mhfps = [mhfp_encoder.encode_mol(mol) for mol in mols]
Check them!
tanimoto_sim = [] for i in range(nmols): for j in range(i): tc = DataStructs.TanimotoSimilarity(mg2fps[i], mg2fps[j]) tanimoto_sim.append(tc) mhfps_sim = [] for i in range(nmols): for j in range(i): jaccard = 1. - MHFPEncoder.distance(mhfps[i], mhfps[j]) mhfps_sim.append(jaccard)
a, b = np.polyfit(tanimoto_sim, mhfps_sim, 1) y2 = np.int64(a) * tanimoto_sim + np.int64(b) print(a, b) > 1.033917242502858 -0.031604772419224866
This results shows ECFP6 and MHFP6 has good correlation I think.
Finally I made a simple scatter plot.
plt.scatter(tanimoto_sim, mhfps_sim) plt.plot(tanimoto_sim, y2, color='black') plt.xlabel('tanimoto') plt.ylabel('mhfp sim')
All code is pushed to my repo.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/MHFP_example.ipynb
In summary, I tried to use MHFP and it shows good correlation with ECFP.
I used very small dataset(47 molecules), so it can not check speed for large dataset.
I would like to check it near the future.
Last week, I participated CBI and a software UGM.
I am happy that I could have fruitful discussions. I could get many ideas for next challenge!
;-)