Ultra fast similarity search with GPU #RDKit #chemoinformatics #postgresql-rdkit

Recently chemoinformatician need to tackle against huge amount of molecules. Search similar molecules from millions of compound database. Last year, schrodinger which is computer science company disclosed useful code for fast compound search module named gpusimilarity.

You can get details of the module from schrodingers github repository. URL is below.

The algorithm is implemented in LiveDesign which is their products but source code is disclosed in github. So I tried to use the code.

To install the module, user need to install cmake, boost and cuda which is described in the README. In the readme, ccmake is used but I sued cmake because ccmake caused error.

$ git clone https://github.com/schrodinger/gpusimilarity.git
$ mkdir bld
$ cd bld
$ cmake -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DBOOST_ROOT=/home/iwatobipen/boost ../gpiusimilarity
$ make -j4

I got some error message about ctest but build is succeeded.

Then I made database. To making the database, I used chembl 26 with rdkit cartridge.

I build chembl26 DB with postgres and installed rdkit postgres cartridge.

Then conduct following sql command. I picked up molecules which AWS is less than 700.

\COPY (select m, molregno from rdk.mols where mol_amw(m) <= 700) to '/home/iwatobipen/dev/gpusim/bld/python/chembl26.csv' with csv delimiter ' ';

#from shell
$ gzip chembl26.csv

Now ready. Let’s make database for gpusim.

#current directory is bld
$ cd python
$ time python gpusim_createdb.py chembl26.csv.gz default.fsim

# it will take few minutes...
>Processed 1846513 rows
>Database generation finished with key: 

>real    8m52.190s
>user    8m51.743s
>sys     0m0.464s

Now got default.fsim file for gpu search. Then run the search searver.

$ python gpusim_server.py default.fsim --http_interface

Starting up GPUSim Server
Utilizing 1 GPUs for calculation.
Extracting data: "default.fsim"
  loading FP  1 of 1
  loading SMI  1 of 1
  loading ID  1 of 1
  waiting for data processing threads to finish...
Running HTTP server...
  merging smiles vectors
  merging ID vectors
  finished merging vectors
Finished extracting data
Database loaded with 1846513 molecules
Database:   225 MB GPU Memory:  3840 MB
Putting graphics card data up.
Finished putting graphics card data up.
Ready for searches.

After starting server with default settings, I could access ‘localhost:8080’.

I searched tofacitinib as an example its smiles is ‘CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N’.

Web output image is below. And search time was ‘Search completed, time elapsed: 0.022’.

The test conducted my PC which has one GeForce GTX 1650. If user who can use more rich GPU, the speed will be much more faster. And the code can return the result as json format. It means that many it has many possibility to develop your own chemoinformatics services!

In summary, gpu similarity search is very useful tool for chemoinformatics.


Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

2 thoughts on “Ultra fast similarity search with GPU #RDKit #chemoinformatics #postgresql-rdkit

  1. Thanks for sharing very useful script. Im facing issue while launching gpusim_server.py command (FileNotFoundError: [WinError 2] The system cannot find the file specified). Any suggestions on this?

    1. Hi,
      Thanks for your comment. Which OS do you use? I think it related windows system problem.
      I could some information in web. Could you please provide more details of the error?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: