High speed sorting algorithm ( sketch sort )

Somedays ago, I attended CBI about AI for drug discovery. A presenter introduced about sorting algorithm named “sketch sort”.
Detail of the algorithm is described in following url.
http://www.jmlr.org/proceedings/papers/v13/tabei10a/tabei10a.pdf
The algorithm can k-search nearest neighbor in short time.
The source code has been published in google corde archive. I tried to sort compound fingerprint using “sketch sort”.
First, I compiled code in my PC.

iwatobipen$ wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/sketchsort/sketchsort-0.0.8.tar.bz2
iwatobipen$ tar -xjvf sketchsort-X.X.X.tar.bz2 cd sketchsort-X.X.X/src 
iwatobipen$ cd sketchsort-X.X.X/src
iwatobipen$ make

Just ready, Next I made input file from SDF.
Input format is space separated values. I used MorganFP bit strings for input. The convert code is following.

import click
from rdkit import Chem
from rdkit.Chem import AllChem

def converter( fpobject ):
    fpstring = fpobject.ToBitString()
    res = ""
    for s in fpstring:
        res += s + " "
    res = res.rstrip()+"\n"
    return res

@click.command()
@click.argument( 'inputfile' )
def cmd( inputfile ):
    mols = [ mol for mol in Chem.SDMolSupplier( inputfile ) if mol != None ]
    fps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in mols ]
    output = open( 'fps.txt', 'w' )
    for fp in fps:
        fpstring = converter( fp )
        output.write( fpstring )
def main():
    cmd()

if __name__ == '__main__':
    main()

When I run the code with sdffilename, I could get fingerprint as text file.
Let’s sort!
Following code is same as the document. (https://code.google.com/archive/p/sketchsort/)
User need to define threshold of cosine distance and missing ratio. Chuk size is default 3.

iwatobipen$ time ./sketchsort -auto -centering -cosdist 0.5 -missingratio 0.001 ../dat/fps.txt output4
SketchSort version 0.0.8
Written by Yasuo Tabei

deciding parameters such that the missing edge ratio is no more than 0.001
decided parameters:
hamming distance threshold: 9
number of blocks: 12
number of chunks: 30

missing edge ratio:0.000850366
start reading
end reading
readtime:0.021719
start making input-data centered at 0
end making input-data centered at 0
centering time:0.000668
number of data:47
data dimension:2048
projected dimension:32
length of strings:960
number of chunks:30
start projection
end projection
projecttime:0.100002
chunk distance:9
the number of blocks:12
start enumeration chunk no 1
start enumeration chunk no 2
start enumeration chunk no 3
start enumeration chunk no 4
start enumeration chunk no 5
start enumeration chunk no 6
....
start enumeration chunk no 26
start enumeration chunk no 27
start enumeration chunk no 28
start enumeration chunk no 29
start enumeration chunk no 30
msmtime:0.010782
cputime:0.111323
numSort:480
numHamDist:10023
numCosDist:285

real	0m0.142s
user	0m0.134s
sys	0m0.003s

# Check result
iwatobiepn$ cat output4
iwatobipen$ cat output4
3 4 0.243067
1 4 0.356637
1 3 0.278808
35 34 0.237244
29 28 0.245148
12 32 0.323389
25 26 0.262942
....

Format of output is indexA indexB cosine distance.
In this case, I used very small dataset, I can not feel how fast the algorithm is. But it work so fast against massive dataset.
If I use the soft, I need to make output parser that convert index to sdf or mol ids.

広告

コメントを残す

以下に詳細を記入するか、アイコンをクリックしてログインしてください。

WordPress.com ロゴ

WordPress.com アカウントを使ってコメントしています。 ログアウト / 変更 )

Twitter 画像

Twitter アカウントを使ってコメントしています。 ログアウト / 変更 )

Facebook の写真

Facebook アカウントを使ってコメントしています。 ログアウト / 変更 )

Google+ フォト

Google+ アカウントを使ってコメントしています。 ログアウト / 変更 )

%s と連携中