I’m interested in chemoinformatics and machine learning and a fun of RDKit.
(But I’m still …..)
Calculate molecular fingerprint is a very important process to build QSAR model.
I some time use RDKit-Morgan Fingerprint (like scitegic ECFP4).
Chemfp is a good tool to calculate many type of fingerprint .
It can generate and search fingerprint files. ob2fps, oe2fps, and rdkit2fps use respectively the Open Babel, OpenEye, and RDKit chemistry toolkits to convert structure files into fingerprint files.
It’s easy to install using pip.
pip install chemfp
OK let’s calculate fingerprint.
For example, I used cdk2.sdf that came from RDKIt sample datafile.
iwatobipen$ time rdkit2fps cdk2.sdf --morgan --radius 2 --output cdk2fp_morgan2.fps real 0m0.290s user 0m0.216s sys 0m0.061s
I used morganfingerprint as example, but chemfp can also calculate atompair or rdkit , or MACCS fp too.
Now, I got fps file
iwatobipen$ head -n 7 cdk2fp_morgan2.fps #FPS1 #num_bits=2048 #type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1 #software=RDKit/2014.03.1 #source=cdk2.sdf #date=2014-08-31T11:45:34 02000000000000000000010000800000000400000000800000000000000000000000000800040000000000000000000400000202000000000000000000000000000000000000000000000000000000000004001000008100000200000000000000000000c00000000600000000000000000000000000000020000000000000000000000002000000000000040000000001000800000000000000000100004000000000000000000020000004100000000000000000100200000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000080000002000000002000000000400000000000900000000000 ZINC03814457
I can access fps file from python module.
In [2]: import chemfp In [3]: from chemfp import bitops In [5]: fps = chemfp.load_fingerprints("cdk2fp_morgan2.fps") In [6]: print(fps.metadata) #num_bits=2048 #type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1 #software=RDKit/2014.03.1 #source=cdk2.sdf #date=2014-08-31T11:45:34 In [10]: l = [(i,fp) for i, fp in fps] In [12]: l[0][0] Out[12]: 'ZINC03814479' In [13]: l[0][1].encode("hex") Out[13]: '04000020000000000000010000080000000000000000000000000000000000000000000000000000000000000000000400000202000000000000000011000000000000000000000000000000000000000004000000008100000020000000000000000000000000000000000801000004000000400000000000000800000000081000000000000000000000000000010000000800000000000000000100000000000000000020000000000000100100000000000000120a00020000400000000000000000000400000000000000000000020000000000000000000000000000000000000000000080000000000000000000000000000000000000000000000000' In [14]: fp1 = l[0][1] In [15]: fp2 = l[1][1] In [16]: tc = bitops.byte_tanimoto(fps1,fps2) In [17]: print tc 0.5 # search using tanimoto In [41]: q_id, q_fp = fps[0] In [42]: targets = fps[1:] In [43]: time print chemfp.search.count_tanimoto_hits_fp(q_fp, fps[1:], threshold=0.2) 8 CPU times: user 437 µs, sys: 195 µs, total: 632 µs Wall time: 462 µs #also can search k-nearest targes In [44]: q = next(fps.iter_arenas(3)) In [45]: q.ids Out[45]: ['ZINC03814479', 'ZINC00023543', 'ZINC03814440'] In [46]: targes = fps In [47]: res = [ i for i in chemfp.knearest_tanimoto_search(q, targets, k=2, threshold=0.2)] In [48]: res[0] Out[48]: ('ZINC03814479', <chemfp.search.SearchResult at 0x107eb4690>) In [49]: res[0][1].get_ids() Out[49]: [] In [50]: res[1][1].get_ids() Out[50]: ['ZINC00023543', 'ZINC03814457', 'ZINC03814440'] In [51]: res[1][1].get_scores() Out[51]: array('d', [1.0, 0.5, 0.04225352112676056])
It’s useful for me.