Useful package for descriptor calculation #chemoinformatics #rdkit

Descriptor calculation is an important task for chemoinfomatics. I often use rdkit to do it. And today I found very useful package for descriptor calculation which name is descriptorus. URL is below.

https://github.com/bp-kelley/descriptastorus

It is very easy to install the package. Just following command.

pip install git+https://github.com/bp-kelley/descriptastorus

After did it, I could use the package.

By using the package, following descriptors are calculated very efficiently.

* atompaircounts
* morgan3counts
* morganchiral3counts
* morganfeature3counts
* rdkit2d
* rdkit2dnormalized
* rdkitfpbits

I tried to use the package. Very simple example.

from descriptastorus.descriptors.DescriptorGenerator import MakeGenerator
gen1 = MakeGenerator((‘rdkit2d’,))
smi = ‘c1ccncc1’
data=gen1.process(smi)
data
>out
[True,
3.000000000000001,
75.86113958768547,
4.242640687119286,
3.333964941448087,
3.333964941448087,
…]

Of course it is easy to get column names.

for col in gen1.GetColumns():
print(col[0])
>out

RDKit2D_calculated
BalabanJ
BertzCT
Chi0
Chi0n
Chi0v
Chi1
Chi1n
Chi1v
Chi2n
Chi2v
Chi3n
Chi3v
Chi4n
Chi4v
EState_VSA1
EState_VSA10

The first row indicates whether the calculation is successful or not.

And the package provides normalized descriptors which are useful for machine learning. It is easy to get it.

from descriptastorus.descriptors import rdDescriptors
from descriptastorus.descriptors import rdNormalizedDescriptors
gen2 = rdDescriptors.RDKit2D()
gen3 = rdNormalizedDescriptors.RDKit2DNormalized()
data2 = gen2.process(smi)
data3 = gen3.process(smi)
for i in range(len(data2)):
print(data2[i], data3[i])
>out
True True
3.000000000000001 0.9749367759562906
75.86113958768547 0.0018713336795980954
4.242640687119286 0.0001226379059079931
3.333964941448087 0.00012708489238074725
3.333964941448087 9.889979057774486e-05
3.0 0.00031922002005261777
1.8497311128276561 0.0003685049537727884
1.8497311128276561 0.0005439421913480759
….

Descriptorus can make a DescriptaStore. I failed it because I couldn’t kyotocabinet in my env….

In summary, the package is very useful for chemoinformatician because user can get many rdkit’s descriptors only typing few lines.

It is amazing for me that we can use such an useful open source packages very easily.

Thanks for OSS developers!


Today’s gist is below.

Advertisements

Calculate solvent effect in Psi4 #psi4 #quantumchemistry

Recently I use not only chemoinformatics tools but also quantum chemistry tool, my favorite is Psi4. Psi4 has many options and plug-ins for quantum calculation. Most setting of calculation is vacuum, but it actually true. So considering the solvent around the molecules is important.

Can psi4 perform calculation with solvent effect?

Yes!

PCMSolver is plugin for PCM calculation. It can be installed via conda.

$ conda install -c psi4 pcmsolver

PCM means ‘polarizable continuum model’. PCM describes solute-solvent interactions only by means of electrostatics and polarization between the solute molecule and solvent. The solvent is modeled as a dielectric continuum with a given permittivity ε.

To use PCMsolver in psi4, user need to make specific input file for the calculation which should has kind of solvent, basis function, CPCM or IEFPCM, etc.

I made to input file from toulene.mol file which is retrieved from ChEMBL.

input files are below. PCMsolver can not call from python interpreter so I need to make input.dat.

# toluene in water
#! pcm

molecule mol {

0 1
C -2.0705902285428452 0.09342117730391691 0.21438842743524625
C 0.06881622580040792 1.2107631870687914 0.004377262725162135
C -0.043709196272760036 -1.211640074147377 -0.026831077837745607
C -1.3202142335884217 1.271231179761799 0.15648583022799556
C 0.714364020910022 -0.03184758851751915 -0.10370200526939809
C 2.2076605753403253 -0.0996279156897954 -0.22684462930707974
C -1.4325455046382927 -1.1469904802166477 0.1253309895373383
H -3.1452716192643337 0.14180771794289201 0.3335057008466673
H 0.6396267675023236 2.1305648085368674 -0.0295934787763336
H 0.4394292831589224 -2.1792111515476216 -0.08511724686650844
H -1.8144108252945523 2.2311166905139106 0.23329105314598383
H 2.59680405039162 0.7889905119305699 -0.7681174009541791
H 2.5134086150024317 -1.0062726983460601 -0.7912601501934461
H 2.660308829389096 -0.1337122631434785 0.78606242092797
H -2.013676759894258 -2.058593101450173 0.17802430435839692
no_reorient
no_com
units angstrom

}

set {
basis 6-31G**
scf_type pk
pcm true
pcm_scf_type total
}

pcm = {
Units = Angstrom
Medium {
SolverType = IEFPCM
Solvent = Water
}

Cavity {
RadiiSet = UFF
Type = GePol
Scaling = false
Area = 0.3
Mode = Implicit
}
}

energy_pte, wfn = energy(‘scf’, return_wfn=True)

# toluene in benzene
#! pcm

molecule mol {

0 1
C -2.0705902285428452 0.09342117730391691 0.21438842743524625
C 0.06881622580040792 1.2107631870687914 0.004377262725162135
C -0.043709196272760036 -1.211640074147377 -0.026831077837745607
C -1.3202142335884217 1.271231179761799 0.15648583022799556
C 0.714364020910022 -0.03184758851751915 -0.10370200526939809
C 2.2076605753403253 -0.0996279156897954 -0.22684462930707974
C -1.4325455046382927 -1.1469904802166477 0.1253309895373383
H -3.1452716192643337 0.14180771794289201 0.3335057008466673
H 0.6396267675023236 2.1305648085368674 -0.0295934787763336
H 0.4394292831589224 -2.1792111515476216 -0.08511724686650844
H -1.8144108252945523 2.2311166905139106 0.23329105314598383
H 2.59680405039162 0.7889905119305699 -0.7681174009541791
H 2.5134086150024317 -1.0062726983460601 -0.7912601501934461
H 2.660308829389096 -0.1337122631434785 0.78606242092797
H -2.013676759894258 -2.058593101450173 0.17802430435839692
no_reorient
no_com
units angstrom

}

set {
basis 6-31G**
scf_type pk
pcm true
pcm_scf_type total
}

pcm = {
Units = Angstrom
Medium {
SolverType = IEFPCM
Solvent = Benzene
}

Cavity {
RadiiSet = UFF
Type = GePol
Scaling = false
Area = 0.3
Mode = Implicit
}
}

energy_pte, wfn = energy(‘scf’, return_wfn=True)

PCMsolver can has many kinds of solvents listed below.

    1:'Water',
    2:'Propylene Carbonate',
    3:'Dimethylsulfoxide',
    4:'Nitromethane',
    5:'Acetonitrile',
    6:'Methanol',
    7:'Ethanol',
    8:'Acetone',
    9:'1,2-Dichloroethane',
    10:'Methylenechloride',
    11:'Tetrahydrofurane',
    12:'Aniline',
    13:'Chlorobenzene',
    14:'Chloroform',
    15:'Toluene',
    16:'1,4-Dioxane',
    17:'Benzene',
    18:'Carbon Tetrachloride',
    19:'Cyclohexane',
    20:'N-heptane',
    21:'Explicit' 

After making the input files, following step is as same as default psi4 calculation.

$ psi4 input.dat -o output.dat

After waiting few seconds, I could get two output results. Picked up line 355-363.

## toulene in water
@RHF Final Energy: -269.75544369636918

=> Energetics Energetics <=

Nuclear Repulsion Energy = 268.6733324038011688
One-Electron Energy = -895.7789138809855558
Two-Electron Energy = 357.3548941675961714
PCM Polarization Energy = -0.0019460266422257
Total Energy = -269.7526333362304172

Computation Completed

There is difference in PCM Polarization Energy between in water and in toluene, in water it is -0.0052558561183017A.U.(-3.298kcal/mol) and in tolunene, it is -0.0019460266422257A.U.(-1.22kcal/mol)
*1 a.u. =627.50 kcal/mol

I would like to test some molecules. Are there freely available experimental data set? I’ll try to find them.

Do Pharmas need more data for building better model? #memo #medchem

Is ‘AI’ & ‘BidData’ key for success? Recently many pharmaceutical companies are trying to use AI for their drug discovery project acceleration and cost reduction. However sometime it fails because of lack of training data. It is rare that lots of data is available in early drug discovery projects. It is still difficult to build accurate predictive model with small data set.

Recently attractive project is reported in nature reviews drug discovery.
https://www.nature.com/articles/d41573-019-00120-w

The project named MELLODDY(Machine Learning Ledger Orchestration for Drug Discovery) which is an an €18-million, 3-year IMI project.

According the following URL, the MELLODDY consists 17 partners. And the partner will share about the structure and activity data without IP.
https://www.janssen.com/emea/new-research-consortium-seeks-accelerate-drug-discovery-using-machine-learning-unlock-maximum

  • 10 pharmaceutical companies: Amgen, Astellas, AstraZeneca, Bayer, Boehringer Ingelheim, GSK, Janssen Pharmaceutica NV, Merck KgaA, Novartis, and Institut de Recherches Servier
  • 2 academic universities: KU Leuven, Budapesti Muszaki es Gazdasagtudomanyi Egyetem
  • 4 subject matter experts: Owkin, Substra Foundation, Loodse, Iktos
  • 1 large AI computing company: NVIDIA

The consortium uses Amazon web services for machine learning. I think it is nice approach for data and model management and it will be difficult for pharma in Japan (just my personal opinion)…..

How about the chemical space of the combined compounds data? The data is enough for overcoming the applicability domain?

I hope the project goes well.

Machine learning workflow tool for none programmer #memo #machinelearning #dss

I’m on summer vacation. This summer is high temperature and humidity….. So it is tough for me to running. ;-( And now very big typhoon is coming to Japan. Oops…

Let’s leave that aside for now.

Today I would like to introduce very cool tool for machine learning. Recently we can use many machine learning package with python. Of course I use them for example scikit-learn, tenthorflow, pytorch and kearas etc etc…

It is good tool for programmer but it is little bit difficult for none programmer. Today I found very interesting tool for machine learning which name is data science studio(DSS).

DSS is developed by dataiku where begines in 2013 very new company. DSS is the collaborative data science software. I saw demo on youtube and felt nice. I would like to use it. Fortunately DSS can use freely. User can choice some options for installation, I chose Docker for DSS installation.

It was very easy to build the environment. Just type following 2 lines. ;-)

docker pull dataiku/dss
docker run -p 10000:10000 -d dataiku/dss

Now dss container is build and start. I can access the container localhost:10000.

I made test project named boston which is machine learning project with boston dataset from sklearn.

After uploading the sample data, I could see data make regression model very easily. Following screen shot is view of script. It is easy to filter, remove data, just click and select action from column header.

And also it is very easy to make chart just select chart style and drag&drop the column which you want to visualize. Following example is the scatter plot TAX vs LSAT.

Model building is easy! Just click on/off panel which you want to use. Of course it is easy to set up many combination of hyperprameters.

Now ready just click train button for training.

DSS track all training result and after training I could access all training results.

Following sc is decision trees of Gradient Boost and variables importance.

DSS has many algorithms as default but it is easy to implement your own algorithms.

Work flow of machine learning is datapreparation, analyze data, prepare and optimize model and test the model. DSS is very easy to make the work flow.

The tool don’t have chemoinformatics tools such as rdkit, openbabel etc, but I think it can install DSS server.

Recently progress of machine learning are is very fast. I have to think which is more efficient coding by myself or using workflow tool kit.

Coding is very interesting for me but it will take more time….

Twist and to be square #MedicinalChemistry #memo

MedChem often care about lipophilicity of their own molecule. But it is very important to keep molecular drug like profiles. Spiro cyclic building block and oxetanes are very attractive components for drug discovery.

Following articles are interesting for me.

The first one is letter about azaspiroheptanes and the second one is about oxetane containing molecules. Both articles are published from researchers in AZ. URL is below.

https://pubs.acs.org/doi/10.1021/acsmedchemlett.9b00248
This article disclosed AZ’s experience of replacement of six membered ring such as piprazines, piperidinez and morphorines to azaspiroheptane. Generally the effect of the replacement rowing lipophilicity and increasing basicity even if the spiroring has more carbon atoms than traditional six membered ring.

The author reported some examples for their MMPA in drug discovery projects. ADMET profiles are improved in most of case however loss potency in some case. Because spiro ring has more straight structure compared to the six membered ring and is twisted 90 degree. It affects molecular conformation, so it changes binding pose of the molecule.

In the conclusion, the author analyzed their experience and give suggestions for the building block usage. It is worth to remember I think.


https://pubs.acs.org/doi/abs/10.1021/acs.jmedchem.9b00030

And the second one is also very interesting article about oxetane ring containing molecule.

In the article the author found that oxetane ring is metabolized by microzomal epoxide hydoraze. This is non CYP metabolic pathway.

So, to introduce the oxetane ring, the molecule has possibility of none CYP metabolic pathway to avoid DDI.

The article disclosed many example of metabolic profiles of oxatane containing molecules. I had interested that molecules that have oxetane ring in terminal position is not unstable and showed good drug clearance.

Medicinal chemistry is exciting!

Added Notebook Example for psikit #psi4 #RDKit

Some days ago. I updated psikit, added new function, SAPT.

The method can calculate inter-intra molecular interaction with psi4. Psikit can call the method very easily.

And I added sample notebook in the repository. Following codes are very simple example of SATP for water dimer and FSAPT for phenol dimer.

SAPT for water dimer.



from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from psikit import Sapt

w1 = Chem.MolFromMolFile('water1.mol', removeHs=False)
w2 = Chem.MolFromMolFile('water2.mol', removeHs=False)
Draw.MolsToGridImage([w1,w2])
sapt = Sapt()
sapt.monomer1_from_molfile('water1.mol')
sapt.monomer2_from_molfile('water2.mol')
sapt.make_dimer()
res = sapt.run_sapt()
---stdout is below
Initializing SAPT object...

RHF for monomer A finished in 0.55 seconds.
RHF for monomer B finished in 0.52 seconds.
Building ERI tensor...
...built ERI tensor in 3.180 seconds.
Size of the ERI tensor is 0.36 GB, 82 basis functions.

...finished initializing SAPT object in  4.49 seconds.

Starting electrostatics...
...electrostatics took a total of  0.17 seconds.

Starting exchange...
...exchange took a total of  0.63 seconds.

Starting dispersion...
...dispersion took a total of  8.92 seconds.

Starting induction...
Ind20,r (AB)           -1.37852223 mH       -0.86503511 kcal/mol
Exch-Ind20,r (AB)       0.88580457 mH        0.55585034 kcal/mol
...induction took a total of  15.90 seconds.

SAPT0 Results
----------------------------------------------------------------------
Exch10 (S^2)             10.53844851 mH        6.61297129 kcal/mol
Elst10                  -13.02830646 mH       -8.17537956 kcal/mol
Disp20                   -3.42996225 mH       -2.15233218 kcal/mol
Exch-Disp20               0.61399531 mH        0.38528758 kcal/mol
Ind20,r                  -4.33068313 mH       -2.71754264 kcal/mol
Exch-Ind20,r              2.30125737 mH        1.44405971 kcal/mol
----------------------------------------------------------------------
Total SAPT0              -7.33525065 mH       -4.60293580 kcal/mol

F-SAPT for phenol dimer

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from psikit import Sapt

p1 = Chem.MolFromMolFile('phenol1.mol', removeHs=False)
p2 = Chem.MolFromMolFile('phenol2.mol', removeHs=False)
Draw.MolsToGridImage([p1,p2])
sapt = Sapt()
sapt.monomer1_from_molfile('phenol1.mol')
sapt.monomer2_from_molfile('phenol2.mol')
sapt.make_dimer()
res = sapt.run_fisapt()

--fA.dat and fB.dat files are generated in fsapt folder the  call fsapt.py from terminal, it generates FSAPT results which is shown below.
   => Full Analysis  Reduced Analysis  F-ISAPT: Links 50-50  Full Analysis  Reduced Analysis <=

Frag1     Frag2         Elst     Exch    IndAB    IndBA     Disp    Total 
c1ccccc1_0 c1ccccc1_0    0.686    2.197    0.007   -0.208   -2.403    0.278 
c1ccccc1_0 O_0         -2.751    0.733   -0.147   -0.227   -0.675   -3.067 
O_0       c1ccccc1_0    1.392    0.720    0.222   -0.347   -0.793    1.194 
O_0       O_0         -8.421    6.218   -0.584   -1.512   -1.250   -5.549 
c1ccccc1_0 All         -2.065    2.930   -0.140   -0.435   -3.078   -2.789 
O_0       All         -7.030    6.938   -0.362   -1.859   -2.043   -4.356 
All       c1ccccc1_0    2.078    2.917    0.229   -0.556   -3.196    1.472 
All       O_0        -11.173    6.951   -0.731   -1.739   -1.925   -8.617 
All       All         -9.095    9.868   -0.502   -2.295   -5.121   -7.145 

Psikit can use SAPT, FSAPT efficiently I think. There is room for improvement for FSAPT fragmentation method. Any comments and advice will be greatly appreciated.

URL is below.

GITHUB
https://github.com/Mishima-syk/psikit
FSAPT
https://nbviewer.jupyter.org/github/Mishima-syk/psikit/blob/master/examples/example_sapt/fsapt_ex.ipynb
SAPT
https://nbviewer.jupyter.org/github/Mishima-syk/psikit/blob/master/examples/example_sapt/sapt_ex.ipynb

Lead Optimization Mapper(LOMAP) for FEP calculation #chemoinformatics #RDKit #Networkx

Free energy perturbation (FEP) is useful method for Computer aid drug design. The method calculates energy difference between several similar molecules. And the system is closed system. To perform the calculation for several molecules. The efficient method of making molecules system is required. I found an interesting article today. The url is below.
https://www.ncbi.nlm.nih.gov/pubmed/24072356

The authors developed the library for making the cycle named LOMAP. It is python library ;) so user can install lomap from conda or pypi.

LOMAP makes closed molecular system for FEP according to the following five rules.
– Compounds being compared should be as similar as possible, minimizing atomic deletions and insertions.
– Rings should be preserved as much as possible
– Ligands being compared must share the same net charge.
– Portions of multi-ring systems can only be deleted if rings are planar, and this should be avoided when possible
– Every molecule must be part of at least one closed therby relatmodynamic cycle
– The set of planned calculations should be spanned by relatively few calculations.

The library seems useful for not only FEP but also general drug discovery project. I would like to use it ASAP.

I tried to use it. However to install the package PyQt4.11.4 and networkx 1.11 and several old packages are required. I had many troubles to make the env. (of course Docker is good solution to solve it I know…)

So I decided to fork the repo and modify the code for PyQt5 and networkx 2.x.

There are some difficulties to do it because major version up from networkx1 to network2 changes some methods. So lomap didn’t work well at first. I struggled lots of time to modify the code and now the code worked.

I tested basic mols. Most of the following code is same as the code in README. User need to calculate similarity matrix by calling build_matrices function then call build_graph for graph building.

import lomap
import networkx as nx
db_mol = lomap.DBMolecules('test/basic/', output=True)
# or db_mol = lomap.DBMolecules('test/radial/', output=True)

strict, loose = db_mol.build_matrices()
nx_graph = db_mol.build_graph()

Now ready. Make GraphGen object for graph saving.

Then draw graph and call writeGraph.

from lomap import graphgen
ggen=graphgen.GraphGen(db_mol)
ggen.draw()
ggen.writeGraph()

writeGraph generates several file format graph images such as .eps, .png, .pdf.

basic

Blue edge shows allowed transformations which is non braking rings/bonds. My modified code assigned 2-Naphthol as a singleton. Is it correct behaviour?

And following view seems good I think.

radial

Next example seems work well.

I have to understand more details of the original code and compare result between my modified version and original version.

My modified version can get from git repo. URL is below.
https://github.com/iwatobipen/Lomap

And original repo URL is below.
https://github.com/MobleyLab/Lomap

Any comments and suggestions will be highly appreciated. ;)