Generate fragment fingerprint using RDKit

I often use MorganFP( ECFP or FCFP like ) for machine learning.
But for the chemist, I think it’s difficult to understand. On the other hand, fragment based fingerprint is easy to understand.
There are lots of report about fragment based fingerprint for QSAR. Fortunately, RDKit already implemented the method to generate FragmentFingerprint. ;-)
It seems fun! I wrote a snippet to generate fragmentfingerprint.
I used a python library called Click for option parser. “Click” is very easy to understand and the library can improve readability of the code.
My code is following.
This code need one argument and one option. The argument is input.sdf and option is output filename. Default name of outputfile is calc_frag_fp.npz.
npz extension is needed because of I used savezmethod,

import click
import os
import numpy as np
from rdkit.Chem import RDConfig
from rdkit import Chem
from rdkit.Chem import FragmentCatalog
from rdkit.Chem import DataStructs

fName = os.path.join( RDConfig.RDDataDir, 'FunctionalGroups.txt' )
fparams = FragmentCatalog.FragCatParams( 1, 6, fName )
fcat = FragmentCatalog.FragCatalog( fparams )
fcgen = FragmentCatalog.FragCatGenerator()
fpgen = FragmentCatalog.FragFPGenerator()

@click.argument( 'infile' )
@click.option( '--output','-o', default = 'calc_frag_fp.npz' )
def getfragmentfp( infile, output ):
    fragfps = []
    sdf = Chem.SDMolSupplier( infile )
    counter = 0
    for mol in sdf:
        if mol == None:
        counter += 1
        nAdded = fcgen.AddFragsFromMol( mol, fcat )
    print( "{} mols read".format( counter ) )
    for mol in sdf:
        if mol == None:
        arr = np.zeros((1,))
        fp = fpgen.GetFPForMol( mol, fcat )
        DataStructs.ConvertToNumpyArray( fp, arr )
        fragfps.append( arr )
    fragfps = np.asarray( fragfps )
    np.savez( output, x = fragfps )
    print( fragfps.shape )
    print( 'done!' )

if __name__ == '__main__':

To use the code, command is very simple. Just type..

# not set option value.
iwatobipen$ python cdk2.sdf

Then some massages got from console and output.npz file is generated.

47 mols read
(47, 6118)

The npz file has ndarray data of fragmentfingerprint of the SDF. The array can use input for Machine learning.
I pushed the code to my repository,


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: