Somedays ago, I attended CBI about AI for drug discovery. A presenter introduced about sorting algorithm named “sketch sort”.
Detail of the algorithm is described in following url.
The algorithm can k-search nearest neighbor in short time.
The source code has been published in google corde archive. I tried to sort compound fingerprint using “sketch sort”.
First, I compiled code in my PC.
iwatobipen$ wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/sketchsort/sketchsort-0.0.8.tar.bz2 iwatobipen$ tar -xjvf sketchsort-X.X.X.tar.bz2 cd sketchsort-X.X.X/src iwatobipen$ cd sketchsort-X.X.X/src iwatobipen$ make
Just ready, Next I made input file from SDF.
Input format is space separated values. I used MorganFP bit strings for input. The convert code is following.
import click from rdkit import Chem from rdkit.Chem import AllChem def converter( fpobject ): fpstring = fpobject.ToBitString() res = "" for s in fpstring: res += s + " " res = res.rstrip()+"\n" return res @click.command() @click.argument( 'inputfile' ) def cmd( inputfile ): mols = [ mol for mol in Chem.SDMolSupplier( inputfile ) if mol != None ] fps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in mols ] output = open( 'fps.txt', 'w' ) for fp in fps: fpstring = converter( fp ) output.write( fpstring ) def main(): cmd() if __name__ == '__main__': main()
When I run the code with sdffilename, I could get fingerprint as text file.
Let’s sort!
Following code is same as the document. (https://code.google.com/archive/p/sketchsort/)
User need to define threshold of cosine distance and missing ratio. Chuk size is default 3.
iwatobipen$ time ./sketchsort -auto -centering -cosdist 0.5 -missingratio 0.001 ../dat/fps.txt output4 SketchSort version 0.0.8 Written by Yasuo Tabei deciding parameters such that the missing edge ratio is no more than 0.001 decided parameters: hamming distance threshold: 9 number of blocks: 12 number of chunks: 30 missing edge ratio:0.000850366 start reading end reading readtime:0.021719 start making input-data centered at 0 end making input-data centered at 0 centering time:0.000668 number of data:47 data dimension:2048 projected dimension:32 length of strings:960 number of chunks:30 start projection end projection projecttime:0.100002 chunk distance:9 the number of blocks:12 start enumeration chunk no 1 start enumeration chunk no 2 start enumeration chunk no 3 start enumeration chunk no 4 start enumeration chunk no 5 start enumeration chunk no 6 .... start enumeration chunk no 26 start enumeration chunk no 27 start enumeration chunk no 28 start enumeration chunk no 29 start enumeration chunk no 30 msmtime:0.010782 cputime:0.111323 numSort:480 numHamDist:10023 numCosDist:285 real 0m0.142s user 0m0.134s sys 0m0.003s # Check result iwatobiepn$ cat output4 iwatobipen$ cat output4 3 4 0.243067 1 4 0.356637 1 3 0.278808 35 34 0.237244 29 28 0.245148 12 32 0.323389 25 26 0.262942 ....
Format of output is indexA indexB cosine distance.
In this case, I used very small dataset, I can not feel how fast the algorithm is. But it work so fast against massive dataset.
If I use the soft, I need to make output parser that convert index to sdf or mol ids.