calculate fingerprint

I’m interested in chemoinformatics and machine learning and a fun of RDKit.
(But I’m still …..)

Calculate molecular fingerprint is a very important process to build QSAR model.
I some time use RDKit-Morgan Fingerprint (like scitegic ECFP4).
Chemfp is a good tool to calculate many type of fingerprint .
It can generate and search fingerprint files. ob2fps, oe2fps, and rdkit2fps use respectively the Open Babel, OpenEye, and RDKit chemistry toolkits to convert structure files into fingerprint files.
It’s easy to install using pip.

pip install chemfp

OK let’s calculate fingerprint.
For example, I used cdk2.sdf that came from RDKIt sample datafile.

iwatobipen$ time rdkit2fps cdk2.sdf --morgan --radius 2 --output cdk2fp_morgan2.fps

real	0m0.290s
user	0m0.216s
sys	0m0.061s

I used morganfingerprint as example, but chemfp can also calculate atompair or rdkit , or MACCS fp too.

Now, I got fps file

iwatobipen$ head -n 7 cdk2fp_morgan2.fps 
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1
02000000000000000000010000800000000400000000800000000000000000000000000800040000000000000000000400000202000000000000000000000000000000000000000000000000000000000004001000008100000200000000000000000000c00000000600000000000000000000000000000020000000000000000000000002000000000000040000000001000800000000000000000100004000000000000000000020000004100000000000000000100200000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000080000002000000002000000000400000000000900000000000	ZINC03814457

I can access fps file from python module.

In [2]: import chemfp
In [3]: from chemfp import bitops
In [5]: fps = chemfp.load_fingerprints("cdk2fp_morgan2.fps")
In [6]: print(fps.metadata)
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1

In [10]: l = [(i,fp) for i, fp in fps]
In [12]: l[0][0]
Out[12]: 'ZINC03814479'
In [13]: l[0][1].encode("hex")
Out[13]: '04000020000000000000010000080000000000000000000000000000000000000000000000000000000000000000000400000202000000000000000011000000000000000000000000000000000000000004000000008100000020000000000000000000000000000000000801000004000000400000000000000800000000081000000000000000000000000000010000000800000000000000000100000000000000000020000000000000100100000000000000120a00020000400000000000000000000400000000000000000000020000000000000000000000000000000000000000000080000000000000000000000000000000000000000000000000'

In [14]: fp1 = l[0][1]
In [15]: fp2 = l[1][1]
In [16]: tc = bitops.byte_tanimoto(fps1,fps2)
In [17]: print tc

# search using tanimoto
In [41]: q_id, q_fp = fps[0]
In [42]: targets = fps[1:]
In [43]: time print, fps[1:], threshold=0.2)
CPU times: user 437 µs, sys: 195 µs, total: 632 µs
Wall time: 462 µs

#also can search k-nearest targes
In [44]: q = next(fps.iter_arenas(3))
In [45]: q.ids
Out[45]: ['ZINC03814479', 'ZINC00023543', 'ZINC03814440']
In [46]: targes = fps
In [47]: res = [ i for i in chemfp.knearest_tanimoto_search(q, targets, k=2, threshold=0.2)]

In [48]: res[0]
Out[48]: ('ZINC03814479', < at 0x107eb4690>)

In [49]: res[0][1].get_ids()
Out[49]: []

In [50]: res[1][1].get_ids()
Out[50]: ['ZINC00023543', 'ZINC03814457', 'ZINC03814440']
In [51]: res[1][1].get_scores()
Out[51]: array('d', [1.0, 0.5, 0.04225352112676056])

It’s useful for me.