Recently I found an article that describe new method for fast fingerprint calculation.

You can read the article from chemrxiv, URL is below.

https://chemrxiv.org/articles/A_Probabilistic_Molecular_Fingerprint_for_Big_Data_Settings/7176350

They used MinHash method.

MinHash method is the way to estimate jaccard similarity very efficiently. The authors developed MHFP (MinHash Fingerprint) and compared the performance with ECFP4.

”’

? MinHash ?

for example..

http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/

”’

They discussed the performance of MFHP6 (6 means radius 3) and the FP generally outperforms MHFP4, ECFPxs.

In fig6. shows performance analysis of k-nearest neighbor search and MHFP6 works very nice and fast.

Fortunately, author disclosed source code on github. You can use it if you would like to use it.

https://github.com/reymond-group/mhfp

Now I tried to use it and compared similarity between ECFP and MHFP.

Code is below.

@jupyter notebook

Load packages.

%matplotlib inline import matplotlib.pyplot as plt import numpy as np from rdkit import Chem from rdkit.Chem import AllChem from rdkit.Chem import DataStructs from mhfp.encoder import MHFPEncoder mhfp_encoder = MHFPEncoder() /sourcecode] Calculate fingerprints! mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None] nmols = len(mols) #Calc morgan fp mg2fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 3) for mol in mols] #Calc min hash fp mhfps = [mhfp_encoder.encode_mol(mol) for mol in mols]

Check them!

tanimoto_sim = [] for i in range(nmols): for j in range(i): tc = DataStructs.TanimotoSimilarity(mg2fps[i], mg2fps[j]) tanimoto_sim.append(tc) mhfps_sim = [] for i in range(nmols): for j in range(i): jaccard = 1. - MHFPEncoder.distance(mhfps[i], mhfps[j]) mhfps_sim.append(jaccard)

a, b = np.polyfit(tanimoto_sim, mhfps_sim, 1) y2 = np.int64(a) * tanimoto_sim + np.int64(b) print(a, b) > 1.033917242502858 -0.031604772419224866

This results shows ECFP6 and MHFP6 has good correlation I think.

Finally I made a simple scatter plot.

plt.scatter(tanimoto_sim, mhfps_sim) plt.plot(tanimoto_sim, y2, color='black') plt.xlabel('tanimoto') plt.ylabel('mhfp sim')

All code is pushed to my repo.

https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/MHFP_example.ipynb

In summary, I tried to use MHFP and it shows good correlation with ECFP.

I used very small dataset(47 molecules), so it can not check speed for large dataset.

I would like to check it near the future.

Last week, I participated CBI and a software UGM.

I am happy that I could have fruitful discussions. I could get many ideas for next challenge!

;-)