Install rdkit to CentOS7

I use OSX or unbuntu for coding usually.
But, some case CentOS is needed to make some services.
So, I checked the way to install rdkit to centos7.
I used docker for my test, because docker can make virtual environment in my mac book.
First I pull the Cent OS ver7.

iwatobipen$ docker pull centos:7

Then run the image.

iwatobipen$ docker run -it centos:7

OK I can log in centos7.😉
Next, I installed pyenv for install anaconda.
I used dnf instead of yum. So, I installed dnf using yum.
Command is following.

[root@1db835ac61b4 ~]# mkdir rdkittest & cd rdkittest
[root@1db835ac61b4 rdkittest]# yum install epel-release
[root@1db835ac61b4 rdkittest]# yum install dnf

OK dnf is installed.
But I couldn’t update dnf, so I referred following url and updated dnf.
http://stackoverflow.com/questions/32541196/i-attempted-to-enable-the-epel-repo-on-my-fedora-22-machine-and-i-broke-it-now

[root@1db835ac61b4 rdkittest]# rm -rf /etc/yum.repos.d/epel*
[root@1db835ac61b4 rdkittest]# dnf clear all
[root@1db835ac61b4 rdkittest]# dnf install epel-release
[root@1db835ac61b4 rdkittest]# dnf upgrade

Next, Install pyenv.

[root@1db835ac61b4 rdkittest]# dnf install -y gcc zlib-devel bzip2 bzip2-devel readline-devel sqlite sqlite-devel openssl-devel
[root@1db835ac61b4 rdkittest]# git clone https://github.com/yyuu/pyenv.git ~/.pyenv
[root@1db835ac61b4 rdkittest]# echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
[root@1db835ac61b4 rdkittest]# echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
[root@1db835ac61b4 rdkittest]# echo 'eval "$(pyenv init -)"' >> ~/.bash_profile

Done! I just installed pyenv. So install anaconda ASAP.😉
And install rdkit!

[root@1db835ac61b4 rdkit]# pyenv install anaconda3-4.1.0
[root@1db835ac61b4 rdkit]# pyenv local anaconda3-4.1.0
[root@1db835ac61b4 rdkit]# conda install -c rdkit rdkit=2016.03.4

Finished! Now I got rdkit in CentOS7.
Check it.

[root@1db835ac61b4 rdkittest]# ipython
Python 3.5.1 |Anaconda 4.1.0 (64-bit)| (default, Jun 15 2016, 15:32:45)
Type "copyright", "credits" or "license" for more information.

IPython 4.2.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from rdkit import Chem

In [2]: from rdkit import rdBase

In [3]: rdBase.rdkitVersion
Out[3]: '2016.03.4'

Works fine.

Anaconda is very useful tool for every one I think!

BTW why I install dnf ?

The answer is yum can handle only python2, but dnf can handle python3.

https://www.linux.com/learn/what-you-need-know-about-fedoras-switch-yum-dnf

 

 

 

Visualize chemical space using Knime rdkit node

Usually I use python for analyse, visualize chemical space. Because, I love coding.😉
I know, work flow tool is useful solution to do that.

So, I tried to plot chemical space using Knime. Knime is one of famous work flow tool and lots of nodes are developed.

I made very simple work flow to do PCA. My work flow is following.
workflow

At first, the flow read smiles strings from excel file. And convert smies to RDKit molecule.
Then calculate morgan FP using RDKit Finger printer. You know, the node can also calculate various FP like MACCS, topological etc.
Next, extend bit vector to 1024 bit columns.
And do PCA and make scatter plot. The plotting node is implemented in Erlwood chemoinformatics node.
When I call view scatter plot, I got following dynamic scatter plot.
scatter plot
The node can select each columns easily and user can set color or size own criteria. And visualize structure as label. Wow cool!

And I set activity cliff viewer.
The node needs two parameter, one of smiles and another is distance matrix of similarity.
N x N distance matrix is generated using distance matrix calculate node.
Finally run the flow, I got network view of activity cliffs.
Screen Shot 2016-08-24 at 11.28.47 PM
Edges that are colored green are indicated activity cliffs. ( in my case delta pIC50 >= 1.0 and similarity >= 0.5 )
Hmm but the image seems to difficult to understand SAR. Cytoscape is suitable tool to visualize network.
Mistake ???

Activity cliffs table seems good.

Knime is powerful tool for medchem.

Scoring 3D diversity using RDKit #RDKit

Recently importance of 3D character of molecules are increasing.
If I design a libraries, I want to estimate not only 2D, but also 3D diversity.
Fortunately RDKit implemented function for characterize the 3D character of molecules named ‘plane of best fit’ (PBF).
You can call this function from rdkit/Contrib/PBF folder.😉 Great!!!
And reference of PBF is following.
Plane of Best Fit: A Novel Method to Characterize the Three-Dimensionality of Molecules
http://pubs.acs.org/doi/abs/10.1021/ci300293f

In the paper, the authors compare with relationship between PBFscore and PMI( principal moment of inertia ). I’m interested in the algorithm, so I calculated two scores.
At first, I can calculate PBF by using RDKit Contrib/PBF code. But code for calculate PMI is not available. So, I write calculator.
My code is following. calc_pmi.py.

import sys
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem
from numpy import linalg
from sklearn.preprocessing import normalize
import numpy as np

def inertia( mol,confId=-1,  principal = True, moments_only = True  ):
    numatoms = mol.GetNumAtoms()
    conf = mol.GetConformer( confId )
    if not conf.Is3D():
        return 0
    #get cordinate of each atoms
    pts =  np.array( [ list( conf.GetAtomPosition(atmidx) ) for atmidx in range( numatoms ) ]  )
    atoms = [ atom for atom in mol.GetAtoms() ]
    mass = Descriptors.MolWt( mol )
    #get center of mass
    center_of_mass = np.array(np.sum( atoms[i].GetMass() * pts[i] for i in range( numatoms ))) /mass

    inertia = np.zeros( (3,3) )
    for atmidx in range( numatoms ):
        cmx,cmy,cmz = pts[ atmidx ] - center_of_mass
        inertia += -atoms[ atmidx ].GetMass() / mass * np.matrix(
                [[ 0, -cmz, cmy ],
                 [ cmz, 0, -cmx ],
                 [ -cmy, cmx, 0 ]]
                ) ** 2
    if principal:
        if moments_only:
            return np.linalg.eigvalsh( inertia )
        else:
            return np.linalg.eigh( inertia )
    else:
        return inertia
if __name__=='__main__':
    mols = [ mol for mol in Chem.SDMolSupplier( sys.argv[1], removeHs=False )  ]
    for mol in mols:
        I = inertia(mol)
        npr1, npr2 = I[0]/I[2], I[1]/I[2]
        print( npr1, npr2 )

OK, let’s start calculation!
I picked up sample from one molecule from the reference.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
IPythonConsole.ipython_useSVG=True
# pbf imported from rdkit/Contrib/PBF
import pbf
import calc_pmi

#(1) mol in the ref.
mol = Chem.MolFromSmiles( 'c1cc(O)c(C(=O)O)cc1N=Nc1ccc(O)c(C(=O)O)c1' )
hmol = Chem.AddHs(mol)
AllChem.EmbedMolecule(hmol,useExpTorsionAnglePrefs=True ,useBasicKnowledge=True)
AllChem.UFFOptimizeMolecule(hmol)
rhmol = Chem.RemoveHs(hmol)

hmolI = calc_pmi.inertia(hmol)
rmolI = calc_pmi.inertia(rhmol)

Compare score with results in the reference.

#PBF score
print(pbf.PBFRD(hmol), pbf.PBFRD(rhmol))
#out
#0.140618602774 0.0769181645171

#NPR1, NPR2
hmolI = calc_pmi.inertia(hmol)
rmolI = calc_pmi.inertia(rhmol)
print(hmolI[0]/hmolI[2], hmolI[1]/hmolI[2])
print(rmolI[0]/rmolI[2], rmolI[1]/rmolI[2])
# out
#with hydrogen
#0.128978367365 0.876583937787
#without hydorgen
#0.125631084212 0.879566191929

In the reference data is following.

NPR1 = 0.089 ( 0.128978367365, 0.125631084212 )
NPR2 = 0.910 ( 0.876583937787, 0.879566191929 )
PBF = 0.00313 ( 0.140618602774 , 0.0769181645171)
*data which surrounded by parentheses is calculated data with hydrogen, and without hydrogen in my code.

My data indicate that PMI does not see depend on existence of hydrogen, but PBF score is depended on hydrogen.
And data without hydrogen seems good correlation of literature data. I can’t understand the reason.😦
I need to study more deep in the method.
BTW RDKit has a lots of function for chemoinformatics. It’s Swiss knife for chemoinformatics!

I referred following URL.
The URL gave good suggestion for me.
https://github.com/wrwrwr/mmoi-calc

Also, You can calculate PMI using Knime vernalis community nodes!!!
Wow! that’s nice for OSS! That’s so cool!😉

Screen Shot 2016-08-16 at 11.19.13 PM

Make virtual machine for chemoinformatics #RDKit

Recently stable version of docker for mac is released. It’s good news for me.😉
https://www.docker.com/products/docker
I used boot2docker before but, now I switched docker for mac. Because it’s easy to install and share the file with host OS.
Of course, I installed docker for mac !

Virtual machine is one of useful way to test my code and keep native environment clean.
Today I wrote sample Docker file for chemoinformatics.
The virtual machine can run rdkit and keras. It’s means the machine can do deep learning with rdkit.
My Docker file is following.
https://github.com/iwatobipen/docker4chmoinfo/blob/master/docker4chemoinfo/Dockerfile
Linux environment can use new version of RDKit.

FROM ubuntu:16.04 
MAINTAINER iwatobipen <seritaka@gmail.com>
RUN apt-get -y update && apt-get -y install wget
RUN apt-get -y install bzip2
RUN apt-get -y install git-all
RUN apt-get -y install libfreetype6-dev libxft-dev
RUN wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh
RUN bash ./Anaconda3-4.1.1-Linux-x86_64.sh -b -p /opt/conda 
RUN rm ./Anaconda3-4.1.1-Linux-x86_64.sh
ENV PATH /bin:/usr/bin:/opt/conda/bin:PATH
RUN conda install -y -c rdkit rdkit=2016.03.3
RUN pip install seaborn
RUN conda install -y -c conda-forge keras=1.0.7
CMD /bin/bash

The run the image use docker run command.

 docker run -i -t iwatobipen/chemoinfo_test /bin/bash

OK check the VM.

root@b33953263220:/# ipython
Python 3.5.2 |Anaconda 4.1.1 (64-bit)| (default, Jul  2 2016, 17:53:06) 
Type "copyright", "credits" or "license" for more information.

IPython 4.2.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from rdkit import Chem
In [2]: from rdkit import rdBase  
In [3]: rdBase.rdkitVersion
Out[3]: '2016.03.3'

In [4]: import keras
Using Theano backend.

In [5]: 

Hmm, my vm works well.
If user run the machine, it already to do deep learning.
My dockerhub repo is following.
https://hub.docker.com/r/iwatobipen/chemoinfo_test/

Enjoy.

Make polar plots of exit vector about di-amine molecules #RDKit

I posted blog about compare exit vector distance of two molecules some days ago.
I used cartesian system to calculate distance of two molecules before. Today I tried to make polar plot using RDKit and matplotlib.😉

I used 3d amine dataset and made my old code.

At first I make scatter plot of similarity and exit vector distance.

%matplotlib inline
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import rdMolTransforms
from rdkit.Chem import DataStructs
import numpy as np
from rdkit.Chem import Draw
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
sdf = Chem.SDMolSupplier( "3d_amine.sdf", removeHs=False )
mols = [ m for m in sdf ]
fps = [ AllChem.GetMorganFingerprintAsBitVect( mol,2 ) for mol in mols ]
mol = mols[1]
mol.GetSubstructMatch( Chem.MolFromSmarts( "N[H]" ) )
atoms = [atom for atom in mol.GetAtoms()]

def getev( mol ):
    if mol.GetNumConformers() >= 1:
        matches = mol.GetSubstructMatches( Chem.MolFromSmarts( "N[H]" ) )
        conf = mol.GetConformer()
        theta = rdMolTransforms.GetDihedralDeg( conf,
                                                matches[0][1],
                                                matches[0][0],
                                                matches[1][0],
                                                matches[1][1]  )
        temp_phi1 = 180 - rdMolTransforms.GetAngleDeg(conf,
                                           matches[1][0],
                                           matches[0][0],
                                           matches[0][1]
                                          )
        temp_phi2 = 180 - rdMolTransforms.GetAngleDeg(conf,
                                           matches[0][0],
                                           matches[1][0],
                                           matches[1][1]
                                          )
        if temp_phi1 >= temp_phi2:
            phi1 = temp_phi1
            phi2 = temp_phi2
        else:
            phi1 = temp_phi2
            phi2 = temp_phi1
            
        r = rdMolTransforms.GetBondLength( conf, matches[0][0], matches[1][0] )
        return theta, phi1, phi2, r
    else:
        print( "No conformer!" )


def transform_cartegian( theta, phi1, phi2, r ):
    theta = np.deg2rad( theta )
    phi1 = np.deg2rad( phi1 )
    phi2 = np.deg2rad( phi2 )
    x = np.sin( theta ) * np.sin( phi1 ) * np.sin( phi2 ) *r
    y = np.sin( theta ) * np.sin( phi1 ) * np.cos( phi2 ) *r
    z = np.sin( theta ) * np.cos( phi1 ) *r
    t = np.cos( theta ) *r
    return x, y, z, t

def get_dist(v1,v2):
    v1 = np.asarray( v1 )
    v2 = np.asarray( v2 )
    delta =  v1 - v2
    d = np.linalg.norm( delta )
    return d

def calc_distance( mol1, mol2 ):
    theta1, phi11, phi21, r1 = getev( mol1 )
    theta2, phi12, phi22, r2 = getev( mol2 )
    cart1 = transform_cartegian( theta1, phi11, phi21, r1 )
    cart2 = transform_cartegian( theta2, phi12, phi22, r2 )
    d = get_dist( cart1, cart2 )
    return d
df = pd.DataFrame(dataset, columns=['dist','sim'])
g=sns.lmplot('dist','sim',df, fit_reg=False)

scatter1

OK, next I made polar plot.

datasets = [ getev(mol) for mol in mols ]
datadict = { 'theta': [data[0] for data in datasets ] , 
            'phi1': [data[1] for data in datasets ],
            'phi2': [data[2] for data in datasets ],
            'r': [data[3] for data in datasets ] }
df = pd.DataFrame(datadict)
df.head()

I got dataframe that has phi, r, theta.

idx phi1 phi2 r theta
0 95.144560 73.623772 3.146983 -72.921352
1 96.238121 68.160414 3.413222 154.397291
2 123.369913 79.976269 4.396864 69.708195
3 79.441975 79.441975 2.953801 180.000000
4 72.223406 70.292055 3.744892 -109.089956

Function of subplot can make multi plots, the three digit number indicates that number of row, number of column and index. So following code, I make three polar plot in one row and three columns.
‘polar = True’ makes polar plot.

polarplot = plt.subplot(131,polar=True)
polarplot.scatter( df.theta, df.r,color='r' )
plt.legend(['theta'],bbox_to_anchor=(1.00, 0))

polarplot = plt.subplot(132,polar=True)
polarplot.scatter( df.phi1, df.r, color='g' )
plt.legend(['phi1'],bbox_to_anchor=(1.00, 0))

polarplot = plt.subplot(133,polar=True)
polarplot.scatter( df.phi2, df.r, color='b' )
plt.legend(['phi2'],bbox_to_anchor=(1.00, 0))
plt.show()

I got following image.
polar plot

Some example of molecules that used in the code….
mols

The plot is interesting for me!
I pushed the snippet to my repo.
https://github.com/iwatobipen/rdkit_evp/blob/master/samplescript/exit_vector_radialplot.ipynb

Natural Product likenes Score

Some years ago, large amount of molecules produced by using palladium catalysed cross coupling reaction, like suzuki-miyaura, negishi, stille, etc.
It showed great impact for medicinal chemistry but these reaction tend to produce flat molecules like low fsp3 score.
Now I often read the word ‘Escape from flat land, sp3 rich molecules, 3D diversity …’.
Natural products shows complex, rich sp3 structure.
So, natural product likeness is one of the score for estimation of the library.
Fortunately, we can get NPscore using RDKit.😉
RDKit implemented following algorithms and easy to use it.
Natural Product-likeness Score and Its Application for Prioritization of Compound Libraries
Peter Ertl, Silvio Roggo, and Ansgar Schuffenhauer
Journal of Chemical Information and Modeling, 48, 68-74 (2008)

Lest try it. At first, I got dataset from NCI.
https://wiki.nci.nih.gov/display/NCIDTPdata/Compound+Sets
Diversity set 5, and Natural products set as SDF. And convert SDF to smiles.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import rdBase
div5 = [ m for m in Chem.SDMolSupplier('Div5_2DStructures_Oct2014.sdf') if m != None ]
nat = [ m for m in Chem.SDMolSupplier( 'NAtProd4.sdf' ) if m != None ]
f = open(  'dataset.txt', 'w' )
for m in div5:
    name = m.GetProp( 'NSC' )
    smi = Chem.MolToSmiles( m )
    f.write(  smi + ' ' + name + ' DIV\n' )
for m in nat:
    name = m.GetProp( 'NSC' )
    smi = Chem.MolToSmiles( m )
    f.write(  smi + ' ' + name + ' NAT\n' )
f.close()

OK, next I run npscore.py and merge resultdata.

NP_Score iwatobipen$ python npscorer.py dataset.txt > res.txt 
import seaborn as sns
import pandas as pd
df1 = pd.read_table( 'dataset.txt', sep=' ', names=['smi','nsc','cat'] )
df2 = pd.read_table( 'res.txt', sep='\t', names=['smi','nsc','np'] )
df=df1.join(df2.np)
sns.distplot( df[df.cat == 'DIV'].np)
sns.distplot( df[df.cat == 'NAT'].np)

I got following image.
NP set showed higher score than Div5 set.

screen shot

Data summary is following.

count    1593.000000
mean       -0.654590
std         1.035716
min        -3.258000
25%        -1.325000
50%        -0.753000
75%        -0.162000
max         4.054000
Name: np, dtype: float64
In [39]:

df[df.cat=='NAT'].np.describe()
Out[39]:
count    419.000000
mean       1.594697
std        1.012731
min       -1.541000
25%        0.911500
50%        1.485000
75%        2.228000
max        4.054000
Name: np, dtype: float64

plot data using RDKit-PandasTools

I often use SDMolSupplier method to read SDF. But, PandasTool is another useful way to read SDF.
But, to handle property data, I need to convert data to float.
Today I read a major supplier’s SDF. And plot data using seaborn.

%matplotlib inline
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit.Chem import PandasTools

IPythonConsole.ipython_useSVG=True
nx = [ mol for mol in Chem.SDMolSupplier( 'dataset.sdf' ) ]
print( m = nx[0][n for n in m.GetPropNames()] )

[out]
[------------
 'CLogP',
 ------------
 'FSP3',
 -------------
'MolWeight',
 -------------
 'Purity_percent',
 'ROT',
-------------
 'Salt',
 'TPSA']

nx = PandasTools.LoadSDF( 'dataset.sdf' )
#convert data
nx.FSP3 = natx.FSP3.apply( np.float32 )
nx.ROT = natx.ROT.apply( np.float32 )
nx.TPSA = natx.TPSA.apply( np.float32 )
nx.CLogP = natx.CLogP.apply( np.float32 )
nx.MolWeight = natx.MolWeight.apply( np.float32 )

sns.lmplot( 'ROT','FSP3', data=natx, )
sns.distplot(natx.MolWeight)
sns.jointplot(natx.MolWeight, natx.FSP3, kind='hex')

Fig 3 indicates that the dataset have many FSp3 rich molecules.
I’m interested in the dataset and will analyse more details of the Dataset.
What is the most important thing in the library design ? Novelty, synthetic accessibility, cost, diversity, druggability, etc. How do you define the quality of compound / library ?

fig1
fsp3rot
fig2
mw
fig3
fsp3_molwt