High speed sorting algorithm ( sketch sort )

Somedays ago, I attended CBI about AI for drug discovery. A presenter introduced about sorting algorithm named “sketch sort”.
Detail of the algorithm is described in following url.

Click to access tabei10a.pdf

The algorithm can k-search nearest neighbor in short time.
The source code has been published in google corde archive. I tried to sort compound fingerprint using “sketch sort”.
First, I compiled code in my PC.

iwatobipen$ wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/sketchsort/sketchsort-0.0.8.tar.bz2
iwatobipen$ tar -xjvf sketchsort-X.X.X.tar.bz2 cd sketchsort-X.X.X/src 
iwatobipen$ cd sketchsort-X.X.X/src
iwatobipen$ make

Just ready, Next I made input file from SDF.
Input format is space separated values. I used MorganFP bit strings for input. The convert code is following.

import click
from rdkit import Chem
from rdkit.Chem import AllChem

def converter( fpobject ):
    fpstring = fpobject.ToBitString()
    res = ""
    for s in fpstring:
        res += s + " "
    res = res.rstrip()+"\n"
    return res

@click.argument( 'inputfile' )
def cmd( inputfile ):
    mols = [ mol for mol in Chem.SDMolSupplier( inputfile ) if mol != None ]
    fps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in mols ]
    output = open( 'fps.txt', 'w' )
    for fp in fps:
        fpstring = converter( fp )
        output.write( fpstring )
def main():

if __name__ == '__main__':

When I run the code with sdffilename, I could get fingerprint as text file.
Let’s sort!
Following code is same as the document. (https://code.google.com/archive/p/sketchsort/)
User need to define threshold of cosine distance and missing ratio. Chuk size is default 3.

iwatobipen$ time ./sketchsort -auto -centering -cosdist 0.5 -missingratio 0.001 ../dat/fps.txt output4
SketchSort version 0.0.8
Written by Yasuo Tabei

deciding parameters such that the missing edge ratio is no more than 0.001
decided parameters:
hamming distance threshold: 9
number of blocks: 12
number of chunks: 30

missing edge ratio:0.000850366
start reading
end reading
start making input-data centered at 0
end making input-data centered at 0
centering time:0.000668
number of data:47
data dimension:2048
projected dimension:32
length of strings:960
number of chunks:30
start projection
end projection
chunk distance:9
the number of blocks:12
start enumeration chunk no 1
start enumeration chunk no 2
start enumeration chunk no 3
start enumeration chunk no 4
start enumeration chunk no 5
start enumeration chunk no 6
start enumeration chunk no 26
start enumeration chunk no 27
start enumeration chunk no 28
start enumeration chunk no 29
start enumeration chunk no 30

real	0m0.142s
user	0m0.134s
sys	0m0.003s

# Check result
iwatobiepn$ cat output4
iwatobipen$ cat output4
3 4 0.243067
1 4 0.356637
1 3 0.278808
35 34 0.237244
29 28 0.245148
12 32 0.323389
25 26 0.262942

Format of output is indexA indexB cosine distance.
In this case, I used very small dataset, I can not feel how fast the algorithm is. But it work so fast against massive dataset.
If I use the soft, I need to make output parser that convert index to sdf or mol ids.


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: