Call psi4 from web app! #Psi4 #QuantumChemistry

Psi4 is a python package for quantum chemistry. I like it. Native psi4 has some difficulties for preparing input file. So psikit is useful I think. And I think webmo is useful package for calling psi4. I’ve always been interested webmo. So I tried to use it today.
Basic function of webmo is provided free licence.

At first I registered webmo for getting licence code. https://www.webmo.net/index.html

After registration, I could get webmo download link, licence number, and password.

Second, I installed several packages with apt command.

apt install apache2 apache2-suexec-custom libcgi-pm-perl

Then create directory structure.

$ cd ~
$ mkdir public_html
$ chmod 755 public_html
$ cd public_html
$ mkdir cgi-bin
$ chmod 755 cgi-bin
$ exit
Start apache
# a2enmod userdir
# a2enmod suexec
# a2enmod cgid
# service apache2 start

Then eddit /etc/apache2/mods-available/userdir.conf, to enable cgi script.


        UserDir public_html
        UserDir disabled root

        
                AllowOverride FileInfo AuthConfig Limit Indexes
                Options MultiViews Indexes SymLinksIfOwnerMatch IncludesNoExec ExecCGI #ExecCGI added
                AddHAndler cgi-script .cgi #Added
                Require method GET POST OPTIONS
        


Then restart apache ;)

# service apache2 restart

Then install webmo with perl script.

$ tar xzf WebMO.9.x.xxx.tar.gz
$ cd WebMO.install
$ perl setup.pl

Now ready, after restarting apache, I could access local webmo HP via web browser. Default setting is username is admin, password is blank.

After login as admin, system asked to change admin password. So I changed and the create user for calling Psi4.

sample URL is below
http://localhost/~yourusername/cgi-bin/webmo/login.cgi

You can get more details about installation is following URL.
https://www.webmo.net/support/installation.html

Then from admin top page, I set psi4 as quantum chemistry engine. Click interface tab, change psi4 from disable to enable. Then click edit button and set path for psi4 execution.

Tips, If reader installed psi4 via conda, the path schould be ‘/home/***/anaconda/envs/yourenv’. In the case, psi4 path is /home/***/anaconda/envs/yourenv/bin/psi4. bin/psi4 should not be included setting path!!!

After completing the configuration, I made user for QM calculation and login the system as a user.

Making input file is very easy, import file or draw molecule with implemented molecular editor. I used editor for the test.

After drawing the molecule, minimize energy and symmetrize(if possible) is done from selecting upper menu bar. It’s very easy isn’t it?

Then set job options which defines calculation method, basis function, reference and several QM related options. Also the setting can do from GUI.

After setting the options, input data can check from preview tab.

All setting is done, click submit job to run the calculation.

Following screenshot shows water as an example because benzoic acid required looooooooooooooong calculation time in my hobby PC.

Result can see jobmanager. I calculated water molecule vibration. So I could see calculated IR values and spectra.

The spectrum is not so accurate, the reason is that I used low level basis set I think.

Of course psi4 raw output file can download from webserver.

With free charge licence, it is not able to draw MO. But with commercial licence, rendering MO of molecule is available.

WebMO is very useful tool for education, and wet researchers. Because it is easy to use and rich GUI.

If reader who has interested in WebMO, try it! ;)

A new function of rdkit201909 #RDKit #Chemoinformatics

Last week, I got cool information that new version of rdkit is available by installing conda!

You can check new feature in ReleaseNotes in original repo,
https://github.com/rdkit/rdkit/blob/master/ReleaseNotes.md

I installed RDKit201909 now and am reading document now.

MolHash function originally developed from nextmove software is implemented in the version.
[RDKit Document]https://www.rdkit.org/docs/source/rdkit.Chem.rdMolHash.html
[Nextmove software]https://nextmovesoftware.github.io/molhash/commandline.html#usage

MolHash can generates some hashes of molecule. I used the module with a molecule. Code is below.

from rdkit import Chem                                                                       
from rdkit.Chem import rdBase                                                                
 print(rdBase.rdkitVersion)                                                                   
> 2019.09.1

from rdkit.Chem import rdMolHash                                                             
# read sample molecule from SMILES
tofa = Chem.MolFromSmiles('C[C@@H]1CCN(C[C@@H]1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N')             
# Generate molhash
print(rdMolHash.GenerateMoleculeHashString(tofa))                                            
100-23-25-Hx54Xg-jb4lRQ-VbT4fg-F92gVQ-wedGAQ-s3wCEg

MolHash module provides some molhash functions. And it can get as dictionary with ‘names’ method and by calling MolHash method with these functions, I could get hash string.

rdMolHash.HashFunction.names                                                         

{'AnonymousGraph': rdkit.Chem.rdMolHash.HashFunction.AnonymousGraph,
 'ElementGraph': rdkit.Chem.rdMolHash.HashFunction.ElementGraph,
 'CanonicalSmiles': rdkit.Chem.rdMolHash.HashFunction.CanonicalSmiles,
 'MurckoScaffold': rdkit.Chem.rdMolHash.HashFunction.MurckoScaffold,
 'ExtendedMurcko': rdkit.Chem.rdMolHash.HashFunction.ExtendedMurcko,
 'MolFormula': rdkit.Chem.rdMolHash.HashFunction.MolFormula,
 'AtomBondCounts': rdkit.Chem.rdMolHash.HashFunction.AtomBondCounts,
 'DegreeVector': rdkit.Chem.rdMolHash.HashFunction.DegreeVector,
 'Mesomer': rdkit.Chem.rdMolHash.HashFunction.Mesomer,
 'HetAtomTautomer': rdkit.Chem.rdMolHash.HashFunction.HetAtomTautomer,
 'HetAtomProtomer': rdkit.Chem.rdMolHash.HashFunction.HetAtomProtomer,
 'RedoxPair': rdkit.Chem.rdMolHash.HashFunction.RedoxPair,
 'Regioisomer': rdkit.Chem.rdMolHash.HashFunction.Regioisomer,
 'NetCharge': rdkit.Chem.rdMolHash.HashFunction.NetCharge,
 'SmallWorldIndexBR': rdkit.Chem.rdMolHash.HashFunction.SmallWorldIndexBR,
 'SmallWorldIndexBRL': rdkit.Chem.rdMolHash.HashFunction.SmallWorldIndexBRL,
 'ArthorSubstructureOrder': rdkit.Chem.rdMolHash.HashFunction.ArthorSubstructureOrder}

Generate MolHash with all defined hash functions.

for k, v in molhashf.items(): 
    print(k, rdMolHash.MolHash(tofa, v)) 

>                                                                                             
AnonymousGraph ****(*)*1**[*@@](*)[*@@](*(*)*2****3****23)*1
ElementGraph C[C@@H]1CCN(C(O)CCN)C[C@@H]1N(C)C1NCNC2NCCC12
CanonicalSmiles C[C@@H]1CCN(C(=O)CC#N)C[C@@H]1N(C)c1ncnc2[nH]ccc12
MurckoScaffold c1nc(N[C@@H]2CCCNC2)c2cc[nH]c2n1
ExtendedMurcko *[C@@H]1CCN(*)C[C@@H]1N(*)c1ncnc2[nH]ccc12
MolFormula C16H20N6O
AtomBondCounts 23,25
DegreeVector 0,8,11,4
Mesomer C[C@@H]1CCN([C]([O])C[C][N])C[C@@H]1N(C)[C]1[N][CH][N][C]2N[CH][CH][C]12_0
HetAtomTautomer C[C@@H]1CCN([C]([O])C[C][N])C[C@@H]1N(C)[C]1[N][CH][N][C]2[N][CH][CH][C]21_1_0
HetAtomProtomer C[C@@H]1CCN([C]([O])C[C][N])C[C@@H]1N(C)[C]1[N][CH][N][C]2[N][CH][CH][C]21_1
RedoxPair C[C@@H]1CCN([C]([O])C[C][N])C[C@@H]1N(C)[C]1[N][CH][N][C]2N[CH][CH][C]12
Regioisomer *C.*C(=O)CC#N.*N(*)*.C.C1CNC[C@H2][C@H2]1.c1ncc2cc[nH]c2n1
NetCharge 0
SmallWorldIndexBR B25R3
SmallWorldIndexBRL B25R3L11
ArthorSubstructureOrder 001700190100100007000092000000

Details of these functions are described in NextMove web page.
https://nextmovesoftware.github.io/molhash/introduction.html

It is interesting for getting several information by using the functions. I’ll think about the application of the method for chemoinformatics task.

Virtual screening with quantum computing! #Quantum_computing #chemoinformatics #memorandum

Quantum computing is one of the hot area in these days. Now google reported exciting article to nature.

Quantum computing reach is also very useful for drug discovery. Some days ago I found interesting article published by researcher in 1QBit, Univ. Drive, Accenture Labs and Biogen. URL is below.
A Quantum-Inspired Method for Three-Dimensional Ligand-Based Virtual Screening

The authors used quantum computer for Ligand Based Virtual Screening(LBVS). Their strategy is below.

Made molecular graph which like pharmacophore representation of molecule. And the search maximum common sub graph(MCS) with quantum computing.

Find MCS problem is NP hard. So it requires huge computational cost. But they solved the issue with quantum annealar.

To find the MCS, they used conflict graph. I’m not familiar the concept but regarding the publication, the graph made from two molecular graphs.

From the fig3 in the article. (Pls check original article because I can’t share figure)

Construction of the conflict graph G c from two given graphs G 1 and G 2 . A vertex (v 1 , v a ) is added to G c because at least one of the labels in the set of labels associated with v 1 from G 1 (3 v 1 ), and v a from G 2 (3 v a ) match. In this case, the set of labels match exactly so we designate the new vertex (v 1 , v a ) as an exact match. The rest of the vertices in the conflict graph are added in the same way. Edges are added according to two
conditions: bijective mapping and distance violations. Bijective mapping is violated if one of the nodes has been matched twice (represented by a red edge). Distance violation aims to incorporate 3D molecular information (represented by a green edge). An edge between two vertices (e.g.,
between (v 1 , v b ) and (v 2 , v c )) is added if the Euclidean distance between v 1 and v 2 is not comparable to the Euclidean distance between v b and v c .
Formally, an edge is added if |d(v 1 , v 2 ) − d(v b , v c )| > ε (ε = 0.4 in this example).

Regarding the concept described above, less edge graph is preferred for maximum common substructure.

And by using the method, graph based approach outperformed Morgan finger print based LBVS against several targets.

It indicates that quantum computer is useful for drug discovery.

Unfortunately the calculation performance of quantum computation is not described in this article.
I would like to know comparison between current traditional computation and quantum computation.

Electrostatic Potential Surface(ESP) calculation with GCNN #RDKit #chemoinformatics

ESP is a key feature in Drug Discovery. There are many publications discussing ESP in Drug Design. However getting accurate ESP is time-consuming because it needs high level QM calculations. To reduce the calculation cost of QM, predict quantum nature by using Deep learning is researched.

And some days ago, I found interesting article published by reseachers in Astex. URL is below.
https://pubs.acs.org/doi/10.1021/acs.jmedchem.9b01129

They build GCNN model with over than 100,000 diverse drug like molecules QM calculation (B3LYP/6-31G* basis set) data set. And their model showed good performance in ESP calculation, pKa prediction and binding affinity prediction.

And also GCNN based calculation is faster than QM based calculation, Fig 3 shows that their approach is 4 order faster than conventional DFT based calculation.

Luckly, author disclosed the code on github. ;)
https://github.com/AstexUK/ESP_DNN

….But,,,,,, the code is written for python 2.x.
Hmm…. I would like to use the code in python 3.x.

So I forked ESP_DNN and modify the original code. You can find the code here.
https://github.com/iwatobipen/ESP_DNN

My env for the ESP_DNN is below.
Python 3.7
numpy 1.17.2
rdkit 2019.03.4
xarray 0.14.0
tensorflow-gpu 1.14.0
keras-base 2.2.4

Install is very easy.

$ git clone https://github.com/iwatobipen/ESP_DNN.git
$ cd ESP_DNN
$ pip install -e . # or/ pip install .

Then run the example code which is introduced in README.md

ESP_DNN$ python -m esp_dnn.predict -m ligand -i examples/ligands/

After running the code forementioned, pqr files are generated in -i folder.
It can be visualized in http://nglviewer.org/ngl/ where is recommend in the document.

Open the URL with web browser and read a pqr file.
Then change and set visualize settings according to the document.
Finally I could get image with ESP. Following ESP is generated from provided GCNN model.

QM based approach is powerful for drug discovery however the calculation cost is high. On the other side, deep learning based approach is very fast but it requires high quality training data.

So I think the authors work is very useful because they provided not only their code but also high quality trained model. Awesome work!

If I have to say something, in the original repo seems that training dataset is not provided. I would like to check chemical space of training data.

Anyway, calculation of accurate ESP in short time is very useful for me. I would like to apply some prediction tasks ASAP.

Old but new molecular generator #RDKit #mishima.syk

In this February, we had opportunity to learn how to use molecular generator in Mishima.syk #13. We used REINVENT for generator.

It works on pytorch ver0.3, but raise error when the code run on new version of pytorch.

Deep learning framework is rapidly upgraded. It is difficult to maintenance OSS code I think.

I demonstrated hands-on code on google colab. But recent version of google colab can’t use old pytorch because cuda version mismatch issue.

So I modified original code of REINVENT and tested it now.

My code is uploaded my repo URL is below.

I modified model.py and train_prior.py. And the code is working now in my PC.

My env is..

Python=3.7
CUDA 10.1
GPU GTX 1650
cudatoolkit 10.1.168
cudnn 7.6.0
pytorch 1.3.0
torchvision 0.4.1
rdkit 2019.03.04

This is not enough for efficient learning because my GPU don’t have enough memory but the code work.
Any comment or feedback will be greatly appreciated.

Integration of RDKit and Neo4j #RDKit #Neo4j #GraphDB #Chemoinformatics

One of the most powerful storms of the year named ‘Hagibis’ is coming now. All weekend schedules are cancelled. I’m staying at my home and writing code…

BTW, In the RDKit UGM 2019, neo4j-rdkit integration project was introduced. The project is one of the topic in Google summer of code. You can find the project in following URL. https://github.com/rdkit/neo4j-rdkit

Neo4j is No-SQL graph based database. Some years ago I wrote post about the DB to store the MMP data but native neo4j function can’t conduct chemical specific query such as sub structure, exact structure search etc. But, if we use neo4j-rdkit plugin, we can do substructure search in neo4j. It seems nice. I would like to try it!

To do it, it is required to build plugin .jar file. The procedure is described in README.md.

At first, install neo4j and maeven. It is easy to do it. For debian, apt is used.
install neo4j
https://neo4j.com/docs/operations-manual/current/installation/
install maeven
https://linuxize.com/post/how-to-install-apache-maven-on-ubuntu-18-04/

$ wget -O - https://debian.neo4j.org/neotechnology.gpg.key | sudo apt-key add -
$ echo 'deb https://debian.neo4j.org/repo stable/' | sudo tee -a /etc/apt/sources.list.d/neo4j.list
$ sudo apt-get update
$ sudo apt-get install neo4j=1:3.5.11
$ sudo apt install maven

After installing neo4j and maeven, the build neo4j-rdkit! Clone repo and build jar files by typing following command.

$ git clone https://github.com/rdkit/neo4j-rdkit.git
$ neo4j-rdkit
$ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKit.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit -Dversion=1.0.0 \
                         -Dpackaging=jar
                         
$ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
                         -Dfile=lib/org.RDKitDoc.jar -DgroupId=org.rdkit \ 
                         -DartifactId=rdkit-doc -Dversion=1.0.0 \
                         -Dpackaging=jar

Then generate .jar file with all dependencies. It can do just type ‘mvn packgage’

After do it cp jar file to neo4j plugins folder. Following my environment’s example.
Then launch neo4j. After staring neo4j, localhost:7474 is accessible from web browser.

$ mvn package
$ sudo cp ./target/rdkit-index-0.0.7.jar /var/lib/neo4j/plugins
$ sudo neo4j start

I imported MMP data by using load csv command. The data already has matched pair information. So there is two nodes information per line.

$ head mmp_data_sets/ChEMBL17_IC50_RECAP_MMP_List.dat 
Target_ChEMBLID	Target_Name	Cpd1_ChEMBLID	Cpd2_ChEMBLID	Cpd1_pIC50	Cpd2_pIC50	Num_Cuts	KeyFragment	Transformation	Cpd1_SMILES	Cpd2_SMILES
CHEMBL1075097	Arginase-1	CHEMBL2348488	CHEMBL2326095	6.28399665637	6.79588001734	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCC[C@@]([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097	Arginase-1	CHEMBL2326085	CHEMBL2348486	6.56863623584	6.49485002168	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCC[C@@]([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]
CHEMBL1075097	Arginase-1	CHEMBL2348488	CHEMBL2348486	6.28399665637	6.49485002168	1	[R1]CCC(CCCCB(O)O)(C(=O)O)N	[R1]N(CC)CC>>[R1]N1CCCC1	OB(O)CCCCC([NH3+])(CC[NH+](CC)CC)C(=O)[O-]	OB(O)CCCCC([NH3+])(CC[NH+]1CCCC1)C(=O)[O-]

From web UI, execute following command. TIP data should be stored in /var/lib/neo4j/import folder. And to use rdkit plugin, CALL the command at first.

#Cypher web browser
CALL org.rdkit.search.createIndex([‘Structure’, ‘Chemical’])

After installing the plugin, some chemoinformatics related property names are reserved listed below. And after importing the data several properties are automatically calculated.

  1. canonical_smiles
  2. inchi
  3. formula
  4. molecular_weight
  5. fp – bit-vector fingerprint in form of indexes of positive bits ("1 4 19 23")
  6. fp_ones – count of positive bits
  7. mdlmol

MDL format, fingerprint, inchi, etc is added !!!!

#Cypher web browser
LOAD CSV WITH HEADERS FROM "file:///ChEMBL17_IC50_RECAP_MMP_10k.csv" AS line
FIELDTERMINATOR '\t'
MERGE (m1:Chemical:Structure { mol_id: line.Cpd1_ChEMBLID, smiles: line.Cpd1_SMILES })
MERGE (m2:Chemical:Structure { mol_id: line.Cpd2_ChEMBLID, smiles: line.Cpd2_SMILES })
MERGE (m1)-[r:MMP { transform:line.Transformation, targetid:line.Target_ChEMBLID}]->(m2); # Maybe MERGE is better...

OK let’s do substructure search! It is really chemistry specific problem. It is useful for researcher for searching MMPs which has user defined substructure. rdkit.*.search function returns canonical smiles and score, so if you want to omit low score results, you should use where clauser I think.

CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'C1CCCCC1') YIELD canonical_smiles, score
MATCH (n)
#WHERE score>500 and n.smiles=canonical_smiles
RETURN n

Of course exact mach search is available.

Similarity search example is below.

Neo4j provides python driver https://neo4j.com/developer/python/.

This is very short example of neo4j-rdkit. LOAD CSV method of my code is not good for large data set. I would like to read neo4j document. And use it more deeply. This is very cool plugin thank you developer!

May the typhoon pass away early!

Particle Swarm Optimization for molecular design #RDKit #Chemoinformatics

I participated RDKit UGM last week. It was worth to go I think. And in the meeting I got useful information for de novo molecular design. You can find the slide deck following URL. https://github.com/rdkit/UGM_2019/blob/master/Presentations/Montanari_Winter_Utilizing_in_silico_models.pdf

They used Particle Swarm Optimization(PSO) for de novo molecular design. PSO is very simple method for parameter optimization. Details are described in wikipedia.

The algorithm of PSO is similar to Q-learning. PSO try to improve objective function by updating velocity during the iteration.

Fortunately, PSO algorithm for molecular generation is disclosed in github. ;)

So I installed mso from github and tried to use it.

At first I installed cddd because mso is depended with cddd. cddd encode molecules to latent space and decode latent space to molecules.
. In the readme of cddd, tensorflow==1.10 is required but it worked tensorflow==1.13. Then installed mso. Both library can install from github.

$ git clone https://github.com/jrwnter/cddd.git
$ cd cddd
$ pip install -e .
$ cd ../
$ git clone https://github.com/jrwnter/mso.git
$ cd mso
$ pip install -e .

After installing the package I used mso. To convenience, I used pretrained model which is provided from cddd repo. Downloaded default model from following URL and stored ./cddd/data folder and unzip it. URL: https://drive.google.com/open?id=1oyknOulq_j0w9kzOKKIHdTLo5HphT99h

Now ready, let’s try.

MSO optimize particle positions with objective functions. The package have many functions. Such as QED, substructure and alert, etc.
Multiple combination is scoring function is also available.

Following example is simple. At first, I used substructure and QED function. Substructure function return 1 if generated molecule has user defined structure. It is useful because RNN based generator often generates molecule randomly so it is difficult to keep specific substructure.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

from mso.optimizer import BasePSOptimizer
from mso.objectives.scoring import ScoringFunction
from mso.objectives.mol_functions import qed_score
from mso.objectives.mol_functions import sa_score
from mso.objectives.mol_functions import substructure_match_score
from functools import partial
from cddd.inference import InferenceModel

sub_match = partial(substructure_match_score, query=Chem.MolFromSmiles('c1ccncc1'))


init_smiles = "c1c(C)cccc1" 
scoring_functions = [ScoringFunction(func=qed_score, name="qed", is_mol_func=True), ScoringFunction(func=sub_match, name='subst', is_mol_func=True)]

substructure_match_score function requires several arguments including substructure. To use it, partial function is used for freeze several args. And then pass functions to ScoringFunction class.

infermodel = InferenceModel()
opt = BasePSOptimizer.from_query(
    init_smiles=init_smiles,
    num_part=200,
    num_swarms=1,
    inference_model=infermodel,
    scoring_functions=scoring_functions)

Then defined optimizer instance. num_part shows number of particles which means number of molecules. And finally run the optimization.

res = opt.run(20)
res0 = res[0]

res0 has best scored smiles and other generated smiles. I retrieve them and draw. Fitness is normalized score.


mols = []
for idx, smi in enumerate(res0.smiles):
    
    try:
        mol = Chem.MolFromSmiles(smi)
        Chem.SanitizeMol(mol)
        if mol != None and res0.fitness[idx] > 0.7:
            print(smi)
            mols.append(mol)
    except:
        pass

OK, generated molecules has use defined substructure. However it seems similar compounds, does it depend on training set?

opt instance has fitness history as pandas dataframe. Fortunately rdkit can PandasTools. ;)


from IPython.display import HTML
from rdkit.Chem import PandasTools
PandasTools.AddMoleculeColumnToFrame(opt.best_fitness_history, smilesCol='smiles')
HTML(opt.best_fitness_history.to_html())

It’s very easy. MSO optimizer does not use GPU, so the process works very fast if user has many CPUs.

MSO seems useful package for molecular generation. User can also define your own objective functions. I use it more deeply.

Today’s code is uploaded to gist.