FPSim2 for fast compound search #fpsim2 #rdkit #chemoinformatics

In the previous posts, I described various way to search compounds from data source such as ChEMBL. For example… using rdkit postgre cartridge, GPUsim which is developed by schrodinger and rdSubstructLibrary which is implemented in RDKit. All methods are very useful.

And today I tried to use FPSim2 which was described in ChEMBL-OG. The package can be installed with conda command.

$ conda install -c efelix fpsim2

If your PC has GPU and cupy, FPSim2 can search compounds with GPU power! And it is easy to use. OK let’s test it.

At the first step, I made two fingerprint databases, one is morgan_fp db which is used for similarity search and the other is rdk_pattern_fp db which is used for substructure search. To make DB, few line of coding is required.

# retrieve smiles from chembl27 and save them to text files.
from pychembldb import *
from rdkit import Chem
q = chembldb.query(CompoundStructure.canonical_smiles)
with open('chembl27.smi', 'w') as outf:
    for row in q:
            if Chem.MolFromSmiles(row[0]):
# make data bases.
from FPSim2.io import create_db_file

create_db_file('chembl27.smi','chembl27_morgan2_2048.h5', 'Morgan', {'radius': 2, 'nBits': 2048})
create_db_file('chembl27.smi', 'chembl27_rdk_pat.h5', 'RDKPatternFingerprint'

It took about 30-40 min to make FP database on my PC. (Dell XPS 7500)

Now ready to search! I used aspirin as test molecule. It is same as original documentation code. ### FPSim2CudaEngine for GPU search, FPSim2Engine for CPU search.###

from FPSim2 import FPSim2Engine
from FPSim2 import FPSim2CudaEngine
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import linecache
chemblsmi = 'chembl27.smi'

fpe = FPSim2Engine('chembl27_morgan2_2048.h5')
fpe_cuda = FPSim2CudaEngine('chembl27_morgan2_2048.h5')

query = 'CC(=O)Oc1ccccc1C(=O)O'
qmol = Chem.MolFromSmiles(query)

FPSim2 search engine returns index of hit molecules. So I defined simple converter index to smiles. This example compound id is same to db index.

$ wc chembl27.smi
1941410 1941410 112956700 chembl27.smi ## I have 1,941,410 molecules.

def getmol(index):
    smi = linecache.getline(chemblsmi, index).rstrip()
    return smi

OK try to similarity search with tanimoto coefficient 0.7 as threshold.

%time results = fpe.similarity(query, 0.7) #CPU search
> Wall time: 8.58 ms
%time results = fpe_cuda.similarity(query, 0.7) #GPU search
> Wall time: 19.6 ms

In the above case CPU search is faster the GPU search…. I think it is due to CPU>GPU overhead takes few msec… If I used more lower threshold to search, GPU engine worked faster than CPU engine.

i.e. threshold = 0.5 with same query molecule,
CPU; Wall time: 34 ms
GPU; Wall time: 20.9 ms

Next try to SSS. FPSim2Engine provides the method but FPSim2CudaEngine doesn’t have the method, so I tried SSS with only CPU Engine.

pattern_fpe = FPSim2Engine('chembl27_rdk_pat.h5')
%time results_pat = pattern_fpe.substructure(query)
> Wall time: 83.8 ms # search with single thread
%time results_pat = pattern_fpe.substructure(query, n_workers=16)
> Wall time: 33.2 ms # search with multiple threads

Search performance is improved by using multiple cores. Finally I tried to use rdkit rdSubstructLibrary.

import pickle
library = pickle.load(open('chembl27_ssslib.pkl', 'rb'))
%time res = library.GetMatches(qmol)
> Wall time: 39.9 ms

rdSubstructureLibrary perform search with multi threads. So the performance seems almost same to FPSim2 SSS.

In brief summary, rdkit rdSubstructLibrary and FPSim2 are useful for compound search. FPSim2 can take moleclar_id in FP db it seems good point however current version can take only integer. I’m happy to use many tools for compound search with very few lines of code.

Today’s experiments are uploaded in my gist.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Published by iwatobipen

I'm medicinal chemist in mid size of pharmaceutical company. I love chemoinfo, cording, organic synthesis, my family.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: