Ultra fast clustering script with RDKit #RDKit

Some years ago, I got very useful information for molecular clustering. Bayon is ultra fast clustering tool. The author made not only Japanese-tutorial but also English-tutorial.
This tools is easy to use but to use bayon in chemoinformatics area, user needs data preparation.

I wrote simple script that converts smiles to bayon input format and then perform clustering and parse output data.
Main code is following.

import argparse
import subprocess
import pickle
import os
from rdkit import Chem
from rdkit.Chem import AllChem

parser = argparse.ArgumentParser( "Fast clustering for chemoinformatics" )
parser.add_argument( "input" )
parser.add_argument( "n", help = "the number of clusters" )
#parser.add_argument( "-l", help = "limit value of cluster bisection" )
#parser.add_argument( "-p", help = "output similarity points" )
parser.add_argument( "--output", default = "clustered.tsv" )
parser.add_argument( "-c", help = "filename of centroid", default = "centroid.tsv" )

def smi2fp( molid, smiles ):
    mol = Chem.MolFromSmiles( smiles )
    onbits = AllChem.GetMorganFingerprintAsBitVect( mol, 2 ).GetOnBits()
    row = molid
    for bit in onbits:
        row += "\tFP_{}\t1.0".format( bit )
    row += "\n"
    return row

if __name__ == "__main__":
    args = parser.parse_args()
    inputf = open( args.input, "r" )
    n = args.n
    c = args.c
    output = args.output
    tempf = open( "fp.tsv", "w" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        tempf.write( smi2fp( line[0], line[1] ))
    res = subprocess.call( "time bayon -p -c {} -n {} fp.tsv > {}".format( c, n, output ), shell=True )

    #parse results
    parsefile = open( output.split(".")[0]+"_parse.tsv", "w" )
    inputf = open( output, "r" )
    for line in inputf:
        line = line.rstrip().split( "\t" )
        cluster_id = line[0]
        for i in range( 1, len( line )-1, 2 ) :
            molid = line[ i ]
            point = line[ i + 1 ]
            parsefile.write("{}\t{}\tCLS_ID_{}\n".format( molid, point, cluster_id ))

The code perform clustering molecules and output cluster with point ( similarity ) and parse default bayon format.

I ran the code with rdkit cdk2.sdf data.
47 compound clustered into 5 clusters within 0.006s!

iwatobipen$ python fastcluster.py cdk2.smi 5

real	0m0.015s
user	0m0.006s
sys	0m0.002s

It seems work fine. 😉


Calculate USRCAT with RDKit #RDKit

Some years ago, I posted blog about USRCAT.
USRCAT is shape based method like ROCS. And it works very fast. The code was freely available but to use the code, user need to install it.
But as you know, new version of RDKit implements this function! That is good news isn’t it.
I tried the function just now.
Source code is following.

import os
import seaborn as sns
import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdBase
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem.rdMolDescriptors import GetUSRScore, GetUSRCAT
from rdkit.Chem import DataStructs
print( rdBase.rdkitVersion )
path = os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )

mols = [ mol for mol in Chem.SDMolSupplier( path ) ]
for mol in mols:
    AllChem.EmbedMolecule( mol, 
                           useExpTorsionAnglePrefs = True,
                           useBasicKnowledge = True )
usrcats = [ GetUSRCAT( mol ) for mol in mols ]
fps = [ AllChem.GetMorganFingerprintAsBitVect( mol, 2 ) for mol in mols ]

data = { "tanimoto":[], "usrscore":[] }

for i in range( len( usrcats )):
    for j in range( i ):
        tc = DataStructs.TanimotoSimilarity( fps[ i ], fps[ j ] )
        score = GetUSRScore( usrcats[ i ], usrcats[ j ] )
        data["tanimoto"].append( tc )
        data["usrscore"].append( score )
        print( score, tc )
df = pd.DataFrame( data )

fig = sns.pairplot( df )
fig.savefig( 'plot.png' )

Run the code.

iwatobipen$ python usrcattest.py
# output
0.4878222403055059 0.46296296296296297
0.2983740604270427 0.48148148148148145
0.36022943735904756 0.5660377358490566
0.3480531986117265 0.5
0.3593106395905704 0.6595744680851063
0.25662588527525304 0.6122448979591837
0.18452571918677163 0.46296296296296297
0.18534407651655047 0.5769230769230769
0.1698894448811921 0.5660377358490566
0.19927998441539707 0.6956521739130435
0.2052241644475582 0.15714285714285714
0.21930710455068858 0.10526315789473684
0.21800341857284924 0.1038961038961039

Tanimoto coeff and USRScore showed different score ( 2D vs 3D pharmacophore ). I think USRScore provides new way to estimate molecular similarity.

RDKit is really cool toolkit. I love it. 😉

Draw high quality molecular image in RDKit #rdkit

Recently, I want to draw high quality image molecule using RDKit. Older version of RDKit png image is not enough for me.
I found the solution in RDKit discuss. The discussion recommended to install cairocffi.
I installed cairocffi via conda.

iwatobipen$ conda install -c conda-forge cairocffi

But… Result is not enough for me. ( this case is my Mac environment. Linux environment worked fine. )

Next I tried to conversion of SVG to PNG. Fortunately, I found cairosvg and the package can convet svg to PNG.
Following code is example.
And I found tips for RDKitters!.
DrawingOptions can modify the drawing settings. For example, font size, bond width etc.

import argparse
import cairosvg
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import DrawingOptions

parser = argparse.ArgumentParser( 'smiles to png inmage' )
parser.add_argument( 'smiles' )
parser.add_argument( '--filename', default="mol." )
DrawingOptions.atomLabelFontSize = 55
DrawingOptions.dotsPerAngstrom = 100
DrawingOptions.bondLineWidth = 3.0 

parser.add_argument( 'smiles' )

if __name__=='__main__':
    param = parser.parse_args()
    smiles = param.smiles
    fname = param.filename
    mol = Chem.MolFromSmiles( smiles )
    Draw.MolToFile( mol, fname+"png" )
    Draw.MolToFile( mol, "temp.svg" )
    cairosvg.svg2png( url='./temp.svg', write_to= "svg_"+fname+"png" )

The code generate png image not only directly from smiles but also from svg.
Here is result.

iwatobipen$ python drawing.py smiles "CC1=CC=C(C=C1)C2=CC(=NN2C3=CC=C(C=C3)S(=O)(=O)N)C(F)(F)F"

Direct png.


Image from SVG is high quality I think. 😉

QED calculation on RDKit 2017.09 #RDKit

QED (quantitative estimate of drug-likeness ) is an one of score of drug likeness reported by Hopkins group.

The author provided QED calculator for pipeline pilot. So QED could not calculate without pipeline pilot.
But, now we can calculate QED by using RDKit!
RDKit 201709 was implemented QED descriptor. Seems good, let’s use the function.
It is very simple. Just call qed!. I used dataset the same as yesterday.

import os
from rdkit.Chem import rdBase, RDConfig
from rdkit import Chem
from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Descriptors import qed
print( rdBase.rdkitVersion )

sdfpath = os.path.join( RDConfig.RDDocsDir, "Book/data/cdk2.sdf" )
mols = [ m for m in Chem.SDMolSupplier( sdfpath ) if m != None ]
df = PandasTools.LoadSDF( sdfpath )
print( len( mols ))

df.head( 2 )

df[ "QED" ] =  df.ROMol.apply( qed )
df.head(2 )

from rdkit.Chem import QED
for mol in mols:
    print( QED.properties( mol ) )

It is easy isn’t it ?
I pushed sample code to my repository.
By the way, original QED score was based on ChEMBL ver 09. So, dataset is old. Does the score show difference when we use new version of ChEMBL ? 😉

New function of RDKit 2017.09 #RDKit

Recently I updated my rdkit env from 201703 to 201709 by using conda.
New version of rdkit was implemented cool function named rdRGroupDeompositon.
The function enable us to render RGroups as DataFrame.
I tried to visualize cdk2.sdf dataset.
Code that I wrote is bellow.(using jupyter notebook)

from rdkit import Chem
from rdkit.Chem import Draw, AllChem
from rdkit.Chem import PandasTools
from rdkit.Chem import rdBase
from rdkit.Chem import RDConfig
from rdkit.Chem.Draw import IPythonConsole
import os
base = RDConfig.RDDocsDir
datapath = os.path.join( base, "Book/data/cdk2.sdf")
mols = [ mol for mol in Chem.SDMolSupplier( datapath ) if mol != None ]
# mol object that has 3D conformer information did not work well. So I remove the conformation info.
for m in mols: tmp = m.RemoveAllConformers()
# define core to RG decomposition.
core = Chem.MolFromSmiles('[nH]1cnc2cncnc21')
from rdkit.Chem import rdRGroupDecomposition
tables = PandasTools.LoadSDF( datapath )
rg = rdRGroupDecomposition.RGroupDecomposition( core )
for mol in mols[:5]:
    rg.Add( mol )
# Do RG deconpositon.

Then visualize RGdecomp result.

import pandas as pd
modlf = PandasTools.LoadSDF( datapath )
frame = pd.DataFrame( rg.GetRGroupsAsColumns() )

Result is following image. 😉
New version of RDKit is cool & powerful tool for chemoinformatics. I really respect the developer of rdkit.

Create MMPDB ( matched molecular pair )!

Matched molecular pair analysis is very common method to analyze SAR for medicinal chemists. There are lots of publications about it and applications in these area.
I often use rdkit/Contrib/mmpa to make my own MMP dataset.
The origin of the algorithm is described in following URL.

Yesterday, good news announced by @RDKit_org. It is release the package that can make MMPDB.
I tried to use the package immediately.
This package is provided from github repo. And to use the package, I need to install apsw at first. APSW can install by using conda.
And the install mmpdb by python script.

iwatobipen$ conda install -c conda-forge apsw
iwatobipen$ git clone https://github.com/rdkit/mmpdb.git
iwatobipen$ cd mmpdb
iwatobipen$ python setup.py install

After success of installation, I could found mmpdb command in terminal.
I used CYP3A4 inhibition data from ChEMBL for the test.
I prepared two files, one has smiles and id, and the another has id and ic50 value.
* means missing value. In the following case, I provided single property ( IC50 ) but the package can handle multiple properties. If reader who is interested the package, please show more details by using mmpdb –help command.

iwatobipen$ head -n 10 chembl_cyp3a4.csv 
iwatobipen$ head -n 10 prop.csv 
924282	*
605	*
1698776	*
59721	19952.62
759749	2511.89
819161	2511.89

mmdb fragment has –cut-smarts option.
It seems attractive for me! 😉
–cut-smarts SMARTS alternate SMARTS pattern to use for cutting (default:
[CH2])]’), or use one of: ‘default’,
‘cut_AlkylChains’, ‘cut_Amides’, ‘cut_all’,
‘exocyclic’, ‘exocyclic_NoMethyl’
Next step, make mmpdb and join the property to db.

# run fragmentation and my input file has header, delimiter is comma ( default is white space ). Output file is cyp3a4.fragments.
# Each line of inputfile must be unique!
iwatobipen$ mmpdb fragment chembl_cyp3a4.csv --has-header --delimiter 'comma' -o cyp3a4.fragments
# rung indexing with fragmented file and create a mmpdb. 
iwatobipen$ mmpdb index cyp3a4.fragments -o cyp3a4.mmpdb

OK I got cyp3a4.mmpdb file. (sqlite3 format)
Add properties to a DB.
Type following command.

iwatobipen$ mmpdb loadprops -p prop.csv cyp3a4.mmpdb
Using dataset: MMPs from 'cyp3a4.fragments'
Reading properties from 'prop.csv'
Read 1 properties for 17143 compounds from 'prop.csv'
5944 compounds from 'prop.csv' are not in the dataset at 'cyp3a4.mmpdb'
Imported 5586 'STANDARD_VALUE' records (5586 new, 0 updated).
Generated 83759 rule statistics (1329408 rule environments, 1 properties)
Number of rule statistics added: 83759 updated: 0 deleted: 0
Loaded all properties and re-computed all rule statistics.

Ready to use DB. Let’s play with the DB.
Identify possible transforms.

iwatobipen$ mmpdb transform --smiles 'c1ccc(O)cc1' cyp3a4.mmpdb --min-pair 10 -o transfom_res.txt
iwatobipen$ head -n3 transfom_res.txt 
1	CC(=O)NCCO	[*:1]c1ccccc1	[*:1]CCNC(C)=O	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	1049493	14	3632	5313.6	-0.71409	-0.033683	-6279.7	498.81	2190.5	7363.4	12530	-2.5576	0.023849
2	CC(C)CO	[*:1]c1ccccc1	[*:1]CC(C)C	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	1026671	20	7390.7	8556.1	-1.1253	-0.082107	-6503.9	-0	8666.3	13903	23534	-3.863	0.0010478

Output file has information of transformation with statistics values.
And the db can use to make a prediction.
Following command can generate two files with prefix CYP3A-.

iwatobipen$ mmpdb predict --reference 'c1ccc(O)cc1' --smiles 'c1ccccc1' cyp3a4.mmpdb  -p STANDARD_VALUE --save-details --prefix CYP3A
iwatobipen$ head -n 3 CYP3A_pairs.txt
rule_environment_id	from_smiles	to_smiles	radius	fingerprint	lhs_public_id	rhs_public_id	lhs_smiles	rhs_smiles	lhs_value	rhs_value	delta
868610	[*:1]O	[*:1][H]	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	1016823	839661	C[C@]12CC[C@@H]3[C@H](CC[C@H]4C[C@@H](O)CC[C@@]43C)[C@@H]1CC[C@H]2C(=O)CO	CC(=O)[C@@H]1CC[C@H]2[C@H]3CC[C@H]4C[C@@H](O)CC[C@]4(C)[C@@H]3CC[C@@]21C	1000	15849	14849
868610	[*:1]O	[*:1][H]	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	3666	47209	O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12	O=c1cc(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12	15849	5011.9	-10837
iwatobipen$ head -n 3 CYP3A_rules.txt 
rule_environment_statistics_id	rule_id	rule_environment_id	radius	fingerprint	from_smiles	to_smiles	count	avg	std	kurtosis	skewness	min	q1	median	q3	max	paired_t	p_value
28699	143276	868610	0	59SlQURkWt98BOD1VlKTGRkiqFDbG6JVkeTJ3ex3bOA	[*:1]O	[*:1][H]	16	-587.88	14102	-0.47579	-0.065761	-28460	-8991.5	-3247.8	10238	23962	0.16674	0.8698
54091	143276	1140189	1	tLP3hvftAkp3EUY+MHSruGd0iZ/pu5nwnEwNA+NiAh8	[*:1]O	[*:1][H]	15	-1617	13962	-0.25757	-0.18897	-28460	-9534.4	-4646	7271.1	23962	0.44855	0.66062

It is worth that the package ca handle not only structure based information but also properties.
I learned a lot of things from the source code.
RDKit org is cool community!
I pushed my code to my repo.

original repo URL is
Do not miss it!

3d conformer fingerprint calculation using RDKit # RDKit

Recently, attractive article was published in ACS journal.
The article describes how to calculate 3D structure based fingerprint and compare some finger prints that are well known in these area.
New method called “E3FP” is algorithm to calculate 3D conformer fingerprint like Extended Connectivity Fingerprint(ECFP). E3FP encodes information only atoms that are connected but also atoms that are not connected.

The author showed several examples. Very similar in 2D but not similar in 3D and vice versa.
Also compare E3FP similarity and ROCS score( TANIMOTO COMBO ) and showed good performance.
I was interested in the fingerprint. Fortunately, the author published the code in Anaconda cloud!!!!!!!
Install it and use it ASAP. ;-D
I am mac user, so installation is very very very easy! Just type.
I found some tips to use the package.
At first, molecules need _Name property to perform calculation.
Second, mol_from_sdf can read molecule from sdf but the function can not read sdf that has multiple molecules. So, I recommend to use molecule list instead of SDF.

conda install -c sdaxen sdaxen_python_utilities
conda install -c keiserlab e3fp

I used CDK2.sdf for test.
E3FP calculates unfolded finger print. But it can convert folded fingerprint and rdkit fingerprint using flod and to_rdkit function.

%matplotlib inline
import pandas as pd
import numpy as np
from rdkit import Chem
from e3fp.fingerprint.generate import fp, fprints_dict_from_mol
from e3fp.conformer.generate import generate_conformers
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
# this sdf has 3D conformer, so I do not need to generate 3D conf.
mols = [ mol for mol in Chem.SDMolSupplier( "cdk2.sdf", removeHs=False ) ]
fpdicts = [ fprints_dict_from_mol( mol ) for mol in mols ]
# get e3fp fingerprint
# if molecule has multiple conformers the function will generate multiple fingerprints.
fps = [ fp[5][0] for fp in fpdicts]
# convert to rdkit fp from e3fp fingerprint
binfp = [ fp.fold().to_rdkit() for fp in fps ]
# getmorganfp
morganfp = [ AllChem.GetMorganFingerprintAsBitVect(mol,2) for mol in mols ]

# calculate pair wise TC
df = {"MOLI":[], "MOLJ":[], "E3FPTC":[], "MORGANTC":[],"pairidx":[]}
for i in range( len(binfp) ):
    for j in range( i ):
        e3fpTC = DataStructs.TanimotoSimilarity( binfp[i], binfp[j] )
        morganTC = DataStructs.TanimotoSimilarity( morganfp[i], morganfp[j] )
        moli = mols[i].GetProp("_Name")
        molj = mols[j].GetProp("_Name")
        df["MOLI"].append( moli )
        df["MOLJ"].append( molj )
        df["E3FPTC"].append( e3fpTC )
        df["MORGANTC"].append( morganTC )
        df["pairidx"].append( str(i)+"_vs_"+str(j) )

The method is fast and easy to use. Bottle neck is how to generate suitable conformer(s).
Readers who interested in the package, please check the authors article.

I pushed my sample code to my repo.