I often use MorganFP( ECFP or FCFP like ) for machine learning.
But for the chemist, I think it’s difficult to understand. On the other hand, fragment based fingerprint is easy to understand.
There are lots of report about fragment based fingerprint for QSAR. Fortunately, RDKit already implemented the method to generate FragmentFingerprint. ;-)
It seems fun! I wrote a snippet to generate fragmentfingerprint.
I used a python library called Click for option parser. “Click” is very easy to understand and the library can improve readability of the code.
My code is following.
This code need one argument and one option. The argument is input.sdf and option is output filename. Default name of outputfile is calc_frag_fp.npz.
npz extension is needed because of I used savezmethod,
import click import os import numpy as np from rdkit.Chem import RDConfig from rdkit import Chem from rdkit.Chem import FragmentCatalog from rdkit.Chem import DataStructs fName = os.path.join( RDConfig.RDDataDir, 'FunctionalGroups.txt' ) fparams = FragmentCatalog.FragCatParams( 1, 6, fName ) fcat = FragmentCatalog.FragCatalog( fparams ) fcgen = FragmentCatalog.FragCatGenerator() fpgen = FragmentCatalog.FragFPGenerator() @click.command() @click.argument( 'infile' ) @click.option( '--output','-o', default = 'calc_frag_fp.npz' ) def getfragmentfp( infile, output ): fragfps = [] sdf = Chem.SDMolSupplier( infile ) counter = 0 for mol in sdf: if mol == None: continue counter += 1 nAdded = fcgen.AddFragsFromMol( mol, fcat ) print( "{} mols read".format( counter ) ) for mol in sdf: if mol == None: continue arr = np.zeros((1,)) fp = fpgen.GetFPForMol( mol, fcat ) DataStructs.ConvertToNumpyArray( fp, arr ) fragfps.append( arr ) fragfps = np.asarray( fragfps ) np.savez( output, x = fragfps ) print( fragfps.shape ) print( 'done!' ) if __name__ == '__main__': getfragmentfp()
To use the code, command is very simple. Just type..
# not set option value. iwatobipen$ python fragfpgen.py cdk2.sdf
Then some massages got from console and output.npz file is generated.
47 mols read (47, 6118) done!
The npz file has ndarray data of fragmentfingerprint of the SDF. The array can use input for Machine learning.
I pushed the code to my repository,