Generate fragment fingerprint using RDKit

I often use MorganFP( ECFP or FCFP like ) for machine learning.
But for the chemist, I think it’s difficult to understand. On the other hand, fragment based fingerprint is easy to understand.
There are lots of report about fragment based fingerprint for QSAR. Fortunately, RDKit already implemented the method to generate FragmentFingerprint. 😉
It seems fun! I wrote a snippet to generate fragmentfingerprint.
I used a python library called Click for option parser. “Click” is very easy to understand and the library can improve readability of the code.
My code is following.
This code need one argument and one option. The argument is input.sdf and option is output filename. Default name of outputfile is calc_frag_fp.npz.
npz extension is needed because of I used savezmethod,

import click
import os
import numpy as np
from rdkit.Chem import RDConfig
from rdkit import Chem
from rdkit.Chem import FragmentCatalog
from rdkit.Chem import DataStructs

fName = os.path.join( RDConfig.RDDataDir, 'FunctionalGroups.txt' )
fparams = FragmentCatalog.FragCatParams( 1, 6, fName )
fcat = FragmentCatalog.FragCatalog( fparams )
fcgen = FragmentCatalog.FragCatGenerator()
fpgen = FragmentCatalog.FragFPGenerator()

@click.command()
@click.argument( 'infile' )
@click.option( '--output','-o', default = 'calc_frag_fp.npz' )
def getfragmentfp( infile, output ):
    fragfps = []
    sdf = Chem.SDMolSupplier( infile )
    counter = 0
    for mol in sdf:
        if mol == None:
            continue
        counter += 1
        nAdded = fcgen.AddFragsFromMol( mol, fcat )
    print( "{} mols read".format( counter ) )
    for mol in sdf:
        if mol == None:
            continue
        arr = np.zeros((1,))
        fp = fpgen.GetFPForMol( mol, fcat )
        DataStructs.ConvertToNumpyArray( fp, arr )
        fragfps.append( arr )
    fragfps = np.asarray( fragfps )
    np.savez( output, x = fragfps )
    print( fragfps.shape )
    print( 'done!' )

if __name__ == '__main__':
    getfragmentfp()

To use the code, command is very simple. Just type..

# not set option value.
iwatobipen$ python fragfpgen.py cdk2.sdf

Then some massages got from console and output.npz file is generated.

47 mols read
(47, 6118)
done!

The npz file has ndarray data of fragmentfingerprint of the SDF. The array can use input for Machine learning.
I pushed the code to my repository,

https://github.com/iwatobipen/fragfp

calculate exit vector distance of each molecule

Some days ago, I participated a seminar about drug discovery.
All presentations were nice, and I interested in one of unique method to representation chemical space, named “exit vector plot”.
ref is following
Following Ramachandran: exit vector plots (EVP) as a tool to navigate chemical space covered by 3D bifunctional scaffolds. The case of cycloalkanes
http://pubs.rsc.org/en/content/articlelanding/2016/ra/c5ra19958a

The concept is, chemical space is represented as two vectors n1, n2 with 4 parameters phi1, phi2, theta( dihedral angle ), r.
It’s like a ramachandran plot that is used to plot structure of protein.

image

In this work, the author calculate exit vector about several ring systems and clustering them.
I think this concept is very useful because if chemist want to replace a ring system to another one, it’s difficult to search molecules that shows low structure similarity but has high exit vector similarity.

So, if we can search molecules use exit vector similarity, it will be new way to expand chemical space.
In the paper, the author clustering molecules using Euclidean distance of exit vector. Distance is almost dissimilarity.
Now let’s try to write exit vector calculator using RDKit!
In following code, I focused in 2nd-diamine set because it was easy to define two vectors(n1, n2).
First I prepare sample sdf that has energy minimized 3D structure.(there is not in the code)
Then write code!
RDKit has cool function to calculate distance and angle in rdMolTransform class.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolTransforms
import numpy as np
from rdkit.Chem import Draw
sdf = Chem.SDMolSupplier( "3d_amine.sdf", removeHs=False )
mols = [ m for m in sdf ]
mol = mols[1]
mol.GetSubstructMatch( Chem.MolFromSmarts( "N[H]" ) )
atoms = [atom for atom in mol.GetAtoms()]

def getev( mol ):
    if mol.GetNumConformers() >= 1:
        matches = mol.GetSubstructMatches( Chem.MolFromSmarts( "N[H]" ) )
        conf = mol.GetConformer()
        theta = rdMolTransforms.GetDihedralDeg( conf,
                                                matches[0][1],
                                                matches[0][0],
                                                matches[1][0],
                                                matches[1][1]  )
        temp_phi1 = 180 - rdMolTransforms.GetAngleDeg(conf,
                                           matches[1][0],
                                           matches[0][0],
                                           matches[0][1]
                                          )
        temp_phi2 = 180 - rdMolTransforms.GetAngleDeg(conf,
                                           matches[0][0],
                                           matches[1][0],
                                           matches[1][1]
                                          )
        if temp_phi1 >= temp_phi2:
            phi1 = temp_phi1
            phi2 = temp_phi2
        else:
            phi1 = temp_phi2
            phi2 = temp_phi1

        r = rdMolTransforms.GetBondLength( conf, matches[0][0], matches[1][0] )
        return theata, phi1, phi2, r
    else:
        print( "No conformer!" )

def transform_cartegian( theta, phi1, phi2, r ):
    theta = np.deg2rad( theta )
    phi1 = np.deg2rad( phi1 )
    phi2 = np.deg2rad( phi2 )
    x = np.sin( theta ) * np.sin( phi1 ) * np.sin( phi2 ) *r
    y = np.sin( theta ) * np.sin( phi1 ) * np.cos( phi2 ) *r
    z = np.sin( theta ) * np.cos( phi1 ) *r
    t = np.cos( theta ) *r
    return x, y, z, t

def get_dist(v1,v2):
    v1 = np.asarray( v1 )
    v2 = np.asarray( v2 )
    delta =  v1 - v2
    d = np.linalg.norm( delta )
    return d

def calc_distance( mol1, mol2 ):
    theta1, phi11, phi21, r1 = getev( mol1 )
    theta2, phi12, phi22, r2 = getev( mol2 )
    cart1 = transform_cartegian( theta1, phi11, phi21, r1 )
    cart2 = transform_cartegian( theta2, phi12, phi22, r2 )
    d = get_dist( cart1, cart2 )
    return d

All done!
Then run code.

for i, mol in enumerate( mols ):
    d = calc_distance(mols[3],mol)
    print( 3, i, d )

I got following response.
idx1 idx2 distace
3 0 4.90798883048
3 1 1.48017700543
3 2 6.08814365316
3 3 0.0
3 4 3.93873942603
3 5 5.45068906229
3 6 1.40146560569
3 7 0.0619944778129
3 8 6.92182054937
3 9 6.22560061852
3 10 6.19913412456
3 11 5.75093874617
3 12 5.38531808804
3 13 1.85179121146
3 14 1.80235655185
idx1, 6, 7 are near to idx4.
Let see…

Draw.MolsToGridImage( [mols[3],mols[1], mols[6], mols[7]] )
Draw.MolsToGridImage( mols )

I got following images.
It’s seems works fine.

selected mols
all mols

I uploaded the code to github.
https://github.com/iwatobipen/rdkit_evp

Seed up of array calculation.

My colleague in CADD team told me that to calculate massive data in python, his recommendation is numpy.
And he showed me very nice code. ( Nice stuff. )
Numpy is the fundamental package for scientific computing with Python. It can handle data as vector, array.

That means native python handles Ns list data with loop of N times. The cost O(N).
But numpy can handle directly Ns list. so, user don’t need write loop. 😉

Somedays ago, I found numexpr in pypi.
The library can boost calculation in some cases.
Numexpr can be installed using pip, conda.
http://www.numpy.org/

1st try.
Vector handling

import numpy as np
import numexpr as ne
a = np.arange(1e6)
b = np.arange(1e6)

#Use numpy
%timeit a*b-4.1*a > 2.5*b
#100 loops, best of 3: 8.51 ms per loop

#User numexr
%timeit ne.evaluate( "a*b-4.1*a > 2.5*b" )
#1000 loops, best of 3: 1.12 ms per loop

2nd Array handling

ar1 = np.random.random((1e3,1e3))
ar2 = np.random.random((1e3,1e3))
#use numpy
%timeit ar1 * ar2
#100 loops, best of 3: 1.95 ms per loop

#use numexr
%timeit ne.evaluate("ar1*ar2")
#1000 loops, best of 3: 673 µs per loop

In both examples numexr showed good performance.