Draw similarity network #RDKit #Cyjupyter

Recently Kei Ono who is developer of cytoscape developed cyjupyter.
https://pypi.org/project/cyjupyter/0.2.0/
It seems attractive for me because the library can draw network diagram on jupyter notebook.
There are many network structured data in chemoinformatics. For example molecule, molecular similarity map and MMP etc… I used the library to draw similarity map of molecules today.
I am newbie of the library, so following code is very simple but there are several useful examples are provided in official repository.
At first, load modules.

import os
import numpy as np
import igraph
from py2cytoscape import util
from cyjupyter import Cytoscape
from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

I used CDK2.sdf as a sample dataset

filedir = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf')
mols = [mol for mol in Chem.SDMolSupplier(filedir) if mol != None]
for mol in mols:
    AllChem.Compute2DCoords(mol)
fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 2) for mol in mols]
smiles_list = [Chem.MolToSmiles(mol) for mol in mols]

Then make graph object, node as an each molecule and make edge if tanimoto similarity more 0.5.

g = igraph.Graph()
for smiles in smiles_list:
    g.add_vertex(name=smiles)
for i in range(len(mols)):
    for j in range(i):
        tc = DataStructs.TanimotoSimilarity(fps[i], fps[j])
        if tc >= 0.5:
            g.add_edge(smiles_list[i], smiles_list[j])

Finally convert graph object to json by using py2cytoscape and Draw graph with default settings.

graph_data = util.from_igraph(g)
Cytoscape(data=graph_data)

I could get following image.

And check network


Cyjupyter is useful network drawing tool for jupyter notebook user. I would like to check the way to control visualization.
My example code is uploaded my repository. URL is below.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/Cyjupyter.ipynb

Advertisements

New fingerprint/MinHash FingerPrint #RDKit #Chemoinformatics

Recently I found an article that describe new method for fast fingerprint calculation.
You can read the article from chemrxiv, URL is below.
https://chemrxiv.org/articles/A_Probabilistic_Molecular_Fingerprint_for_Big_Data_Settings/7176350
They used MinHash method.
MinHash method is the way to estimate jaccard similarity very efficiently. The authors developed MHFP (MinHash Fingerprint) and compared the performance with ECFP4.
”’
? MinHash ?
for example..
http://mccormickml.com/2015/06/12/minhash-tutorial-with-python-code/
”’
They discussed the performance of MFHP6 (6 means radius 3) and the FP generally outperforms MHFP4, ECFPxs.
In fig6. shows performance analysis of k-nearest neighbor search and MHFP6 works very nice and fast.

Fortunately, author disclosed source code on github. You can use it if you would like to use it.
https://github.com/reymond-group/mhfp

Now I tried to use it and compared similarity between ECFP and MHFP.
Code is below.

@jupyter notebook
Load packages.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from mhfp.encoder import MHFPEncoder
mhfp_encoder = MHFPEncoder()
/sourcecode]

Calculate fingerprints!

mols = [mol for mol in Chem.SDMolSupplier('cdk2.sdf') if mol != None]
nmols = len(mols)
#Calc morgan fp
mg2fps = [AllChem.GetMorganFingerprintAsBitVect(mol, 3) for mol in mols]
#Calc min hash fp
mhfps = [mhfp_encoder.encode_mol(mol) for mol in mols]

Check them!

tanimoto_sim = []
for i in range(nmols):
    for j in range(i):
        tc = DataStructs.TanimotoSimilarity(mg2fps[i], mg2fps[j])
        tanimoto_sim.append(tc)
mhfps_sim = []
for i in range(nmols):
    for j in range(i):
        jaccard = 1. - MHFPEncoder.distance(mhfps[i], mhfps[j])
        mhfps_sim.append(jaccard)
a, b = np.polyfit(tanimoto_sim, mhfps_sim, 1)
y2 = np.int64(a) * tanimoto_sim + np.int64(b)
print(a, b)
> 1.033917242502858 -0.031604772419224866

This results shows ECFP6 and MHFP6 has good correlation I think.
Finally I made a simple scatter plot.

plt.scatter(tanimoto_sim, mhfps_sim)
plt.plot(tanimoto_sim, y2, color='black')
plt.xlabel('tanimoto')
plt.ylabel('mhfp sim')


All code is pushed to my repo.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/MHFP_example.ipynb

In summary, I tried to use MHFP and it shows good correlation with ECFP.
I used very small dataset(47 molecules), so it can not check speed for large dataset.
I would like to check it near the future.

Last week, I participated CBI and a software UGM.
I am happy that I could have fruitful discussions. I could get many ideas for next challenge!
;-)

3D Alignment function of RDKit #RDKit

During the UGM, I was interested in Ben Tehan & Rob Smith’s great work.
They showed me a nice example of molecular alignment with RDKit.
RDKit has several function to perform 3D alignment. In the Drug Discovery 3D alignment of ligands is important not only Comp Chem but also Med Chem. After their presentation, I talked them and they told me that GetCrippenO3A is useful for 3D alignment.
Hmm, that’s sounds interesting.
I tried to use the function.
My example code is below. Following code run on ipython notebook. To visualize 3D structure of molecules, I used py3Dmol. It can visualize multiple 3D molecules and easy to use!

import py3Dmol
import copy
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import AllChem
from rdkit.Chem import rdBase
from rdkit.Chem import rdMolAlign
from rdkit.Chem import rdMolDescriptors
import numpy as np
p = AllChem.ETKDGv2()
p.verbose = True

If p.verbose set True, user can get SMARTS patterns which hit the definition of ETKDG like below.

[cH0:1]c:2!@;-[O:3][C:4]: 8 7 6 5, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 3 5 6 7, (0, 0, 4, 0, 0, 0)
[O:1][CX4:2]!@;-[CX3:3]=[O:4]: 6 5 3 4, (0, 3, 0, 0, 0, 0)
[C:1][CX4:2]!@;-[CX3:3]=[O:4]: 0 1 3 4, (0, 0, 1, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 12 11 10 5, (0, 0, 4, 0, 0, 0)
[OX2:1][CX4:2]!@;-[CX4:3][OX2:4]: 10 11 12 16, (0, 0, 3, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)
[C:1][CX4H2:2]!@;-[OX2:3][c:4]: 12 11 10 5, (0, 0, 4, 0, 0, 0)
[OX2:1][CX4:2]!@;-[CX4:3][N:4]: 10 11 12 17, (0, 0, 4, 0, 0, 0)
[cH0:1]c:2!@;-[O:3][C:4]: 3 5 10 11, (100, 0, 0, 0, 0, 0)

Next load molecules and generate conformers. I used cdk2.sdf which is provided in rdkit as sample.

mols = [m for m in Chem.SDMolSupplier('cdk2.sdf') if m != None][:6]
for mol in mols:
    mol.RemoveAllConformers()
hmols_1 = [Chem.AddHs(m) for m in mols]
hmols_2 = copy.deepcopy(hmols_1)
# Generate 100 conformers per each molecule
for mol in hmols_1:
    AllChem.EmbedMultipleConfs(mol, 100, p)

for mol in hmols_2:
    AllChem.EmbedMultipleConfs(mol, 100, p)
# for Ipython notebook
Draw.MolsToGridImage(mols)

To conduct GetCrippenO3A and GetO3A, I calculate crippen_contrib of each atom and MMFF params of molecules.

crippen_contribs = [rdMolDescriptors._CalcCrippenContribs(mol) for mol in hmols_1]
crippen_ref_contrib = crippen_contribs[0]
crippen_prob_contribs = crippen_contribs[1:]
ref_mol1 = hmols_1[0]
prob_mols_1 = hmols_1[1:]

mmff_params = [AllChem.MMFFGetMoleculeProperties(mol) for mol in hmols_2]
mmff_ref_param = mmff_params[0]
mmff_prob_params = mmff_params[1:]
ref_mol2 = hmols_2[0]
prob_mols_2 = hmols_2[1:]

OK Let’s align molecules and visualize them!
I retrieved the best score index from multi conformers of each molecule and added viewer.
For crippenO3A…

p_crippen = py3Dmol.view(width=600, height=400)
p_crippen.addModel(Chem.MolToMolBlock(ref_mol1), 'sdf')
crippen_score = []
for idx, mol in enumerate(prob_mols_1):
    tempscore = []
    for cid in range(100):
        crippenO3A = rdMolAlign.GetCrippenO3A(mol, ref_mol1, crippen_prob_contribs[idx], crippen_ref_contrib, cid, 0)
        crippenO3A.Align()
        tempscore.append(crippenO3A.Score())
    best = np.argmax(tempscore)
    p_crippen.addModel(Chem.MolToMolBlock(mol, confId=int(best)), 'sdf')
    crippen_score.append(tempscore[best])
p_crippen.setStyle({'stick':{}})
p_crippen.render()

For O3A…

p_O3A = py3Dmol.view(width=600, height=400)
p_O3A.addModel(Chem.MolToMolBlock(ref_mol2), 'sdf')
pyO3A_score = []
for idx, mol in enumerate(prob_mols_2):
    tempscore = []
    for cid in range(100):
        pyO3A = rdMolAlign.GetO3A(mol, ref_mol2, mmff_prob_params[idx], mmff_ref_param, cid, 0)
        pyO3A.Align()
        tempscore.append(pyO3A.Score())
    best = np.argmax(tempscore)
    p_O3A.addModel(Chem.MolToMolBlock(mol, confId=int(best)), 'sdf')
    pyO3A_score.append(tempscore[best])
p_O3A.setStyle({'stick':{'colorscheme':'cyanCarbon'}})
p_O3A.render()

In my example, both methods shows good results. To check the details, I will calculate ShapeTanimoto and/or RMSD etc.

In summary, rdkit has many useful functions not only 2D but also 3D. I would like to use this function in my project.

All code of the post, I uploaded my github repo.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/rdkit_3d.ipynb

what is probabilistic programming?

I did not know what PPL is.
Recently I knew probabilistic programing and found nice article in arxiv.
http://hirzels.com/martin/papers/arxiv18-deep-ppl.pdf
A deep probabilistic programming language is a language for specifying both deep NN and probabilistic models.
Probabilistic programming creates systems that help make decisions in the face of uncertainty.

In this article, author describes pros and cons of Deep learning and probabilistic programming.
Advantages from probabilistic programming(PP) is that PP can give not prediction but also a probability. It means PP give probabilistic models to my understanding.
On the other hand, advantages from Deep Learning is feature extraction from large dataset. DL gives high accuracy.
Stan, PyMC is major package for PPL. If you have used tensorflow for DL, I recommend consider edward as one option.
http://edwardlib.org/
Edward does not support recent version of tensorflow. For new version of tensorflow, tensorflow-probability is recommended.

Today, I used edward for PP.
Following example is very simple case.

At first linear regression.
y = wx + b
Define data builder and define variable.
For PP, w and b is stochastic variable, define each variable with Normal().

%matplotlib inline
import edward as ed
import numpy as np
import tensorflow as tf
from edward.models import Normal
from matplotlib import pyplot as plt
def buildtodayta(N, w):
    D = len(w)
    x = np.random.normal(0., 2., size=(N,D))
    y = np.dot(x, w) + np.random.normal(0., 0.01, size=N)
    return x, y
N = 40
D = 10
w_true = np.random.randn(D)*0.4
xt, yt = buildtodayta(N, w_true)
yt.shape, xt.shape

X = tf.placeholder(tf.float32, [N,D])
w = Normal(loc=tf.zeros(D), scale=tf.ones(D))
b = Normal(loc=tf.zeros(1), scale=tf.ones(1))
y = Normal(loc=ed.dot(X, w)+b, scale=tf.ones(N))

Then define theta. Theta is set of latent variables. In the following code, qw and qb is theta.

qw = Normal(loc=tf.get_variable("qw/loc", [D]), scale=tf.nn.softplus(tf.get_variable("qw/scale", [D])))
qb = Normal(loc=tf.get_variable("qb/loc", [1]), scale=tf.nn.softplus(tf.get_variable("qb/scale", [1])))

Finally, to find optimal theta conduct minimization of Kullback-Leibler divergence of q(x|z) and p(x|z).

inference = ed.KLqp({w:qw, b:qb}, data={X:xt, y:yt})
inference.run(n_samples=5, n_iter=100)
y_post = ed.copy(y, {w: qw, b: qb})
def visualise(X_data, y_data, w, b, n_samples=10):
  w_samples = w.sample(n_samples)[:, 0].eval()
  b_samples = b.sample(n_samples).eval()
  plt.scatter(X_data[:, 0], y_data)
  plt.ylim([-10, 10])
  inputs = np.linspace(-8, 8, num=400)
  for ns in range(n_samples):
    output = inputs * w_samples[ns] + b_samples[ns]
    plt.plot(inputs, output)

After learning, I can sampling from qw and qb.
Let’s compare before and after.

Before learning, w and b is not optimized, regression lines are unreasonable.

visualise(xt, yt, w, b, n_samples=3)

After learning the regression lines catch the trend of the dataset.

visualise(xt, yt, qw, qb, n_samples=3)

Next, try to cosine curve.

%matplotlib inline
import edward as ed
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from edward.models import Normal
plt.style.use('seaborn-pastel')
def build_toy_dataset(N=50, noise_std=0.1):
    x = np.linspace(-14,14, num=N)
    y = np.cos(x) + np.random.normal(0, noise_std, size=N)
    x = x.astype(np.float32).reshape((N, 1))
    y = y.astype(np.float32)
    return x, y
plt.scatter(build_toy_dataset()[0], build_toy_dataset()[1], c="b")

Next, define DNN (2layers not so deeeeeeeeep ;-)).

def neural_network(x, W_0, W_1, b_0, b_1):
    h = tf.tanh(tf.matmul(x, W_0) + b_0)
    h = tf.matmul(h, W_1) + b_1
    return tf.reshape(h, [-1])
ed.set_seed(794)
N = 50
D = 1
x_train, y_train = build_toy_dataset(N)
W_0 = Normal(loc=tf.zeros([D,30]), scale=tf.ones([D,30]))
W_1 = Normal(loc=tf.zeros([30,1]), scale=tf.ones([30,1]))
b_0 = Normal(loc=tf.zeros(30), scale=tf.ones(30))
b_1 = Normal(loc=tf.zeros(1), scale=tf.ones(1))
x = x_train
y = Normal(loc = neural_network(x, W_0, W_1, b_0, b_1), scale=tf.ones(N) * 0.1)
qW_0 = Normal(loc=tf.get_variable("qW_0/loc", [D,30]), scale=tf.nn.softplus(tf.get_variable("qW_0/scale", [D,30])))
qW_1 = Normal(loc=tf.get_variable("qW_1/loc", [30,1]), scale=tf.nn.softplus(tf.get_variable("qW_1/scale", [30,1])))
qb_0 = Normal(loc=tf.get_variable("qb_0/loc", [30]), scale=tf.nn.softplus(tf.get_variable("qb_0/scale", [30])))
qb_1 = Normal(loc=tf.get_variable("qb_1/loc", [1]), scale=tf.nn.softplus(tf.get_variable("qb_1/scale", [1])))

rs = np.random.RandomState(0)
inputs = np.linspace(-9, 9, num=400, dtype=np.float32)
x = tf.expand_dims(inputs, 1)
mus = tf.stack([neural_network(x, qW_0.sample(), qW_1.sample(), qb_0.sample(), qb_1.sample()) for _ in range(10)])
sess = ed.get_session()
tf.global_variables_initializer().run()
outputs = mus.eval()

fig = plt.figure(figsize=(10, 6))
ax = fig.add_subplot(111)
ax.set_title('iteration: 0')
ax.plot(x_train, y_train, 'ks', alpha=0.5, label=('x', 'y'))
ax.plot(inputs, outputs[0].T, 'b', lw=2, alpha=0.5, label='prior draws')
ax.plot(inputs, outputs[1:].T, 'y', lw=2, alpha=0.5)
ax.set_xlim([-10, 10])
ax.set_ylim([-3,3])
ax.legend()
plt.show()

inference = ed.KLqp({W_0: qW_0, b_0:qb_0, W_1:qW_1, b_1:qb_1}, data={y: y_train})
inference.run(n_iter=2000, n_samples=5)
outputs = mus.eval()
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(111)
ax.set_title('iteration: 2000')
ax.plot(x_train, y_train, 'ks', alpha=0.5, label=('x', 'y'))
ax.plot(inputs, outputs[0].T, 'b', alpha=0.5, lw=2, label='post draws')
ax.plot(inputs, outputs[1:].T, 'y', alpha=0.5, lw=2, label='post draws')
ax.set_xlim([-14, 14])
ax.set_ylim([-3,3])
ax.legend()
plt.show()

Development of new force field

Molecular force fields development is required human time and expertise.  Comp Chemists often uses FF for their task.  So force field is key parameter to conduct calculations.  But it has still room for improvement.
I have not know that major FF, AMBER-family does not have access to bond order information.

Recently I read an article for new FF development. The title is ‘Open Force Field Consortium: Escaping atom types using direct chemical perception with SMIRNOFF v0.1’
https://www.biorxiv.org/content/early/2018/07/13/286542.full.pdf+html

Fig1 in this article shows representative geometries of 1,2,3,4-tetraphenylbenzene. The geometries are quite different in different Force Fields.

The author proposed new approach called ‘direct chemical perception’ instead of atom typing.
This approach is based on SMIRKS patterns and named SMIRKS Native Open Force
Field (SMIRNOFF) format.
This new FF is now implemented via OpenMM and the Open Eye tools. And good news! The author has plan to implement the FF in RDKit.

If reader who can use OpenEye tool kits you are lucky I think!
Several bench mark is provided in the article and SMIRNOFF shows good performance.

New force field SMIRNOFF is simpler than other FF but powerful for describe feature of molecules.
I am looking forward to progress of the project and would like to think about the way deals with open source and industry.

If reader has interest in this FF, you can get it from the URL below.
https://github.com/openforcefield

Calculate HOMO and LUMO with Psi4 reviced #RDKit #Psi4

Yesterday, I got comments from reader.
Regarding the comment, to calculate HOMO LUMO with psi4 correct way is below.

import psi4
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
psi4.core.set_output_file("output1.dat", True)
def mol2xyz(mol):
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol, useExpTorsionAnglePrefs=True,useBasicKnowledge=True)
    AllChem.UFFOptimizeMolecule(mol)
    atoms = mol.GetAtoms()
    string = string = "\n"
    for i, atom in enumerate(atoms):
        pos = mol.GetConformer().GetAtomPosition(atom.GetIdx())
        string += "{} {} {} {}\n".format(atom.GetSymbol(), pos.x, pos.y, pos.z)
    string += "units angstrom\n"
    return string, mol

Next, calculate HOMO-LUMO of benzene with the function and psi4.

mol = Chem.MolFromSmiles("c1ccccc1")
xyz, mol=mol2xyz(mol)
psi4.set_memory('4 GB')
psi4.set_num_threads(4)
benz = psi4.geometry(xyz)
%time scf_e, scf_wfn = psi4.energy("B3LYP/cc-pVDZ", return_wfn=True)
> CPU times: user 13.8 s, sys: 226 ms, total: 14 s
> Wall time: 4.51 s

After the calculation, I could access HOMO-LUMO, the code is below.

# HOMO = scf_wfn.epsilon_a_subset('AO', 'ALL').np[scf_wfn.nalpha()]
# LUMO = scf_wfn.epsilon_a_subset('AO', 'ALL').np[scf_wfn.nalpha() + 1]
HOMO = scf_wfn.epsilon_a_subset('AO', 'ALL').np[scf_wfn.nalpha()-1]
LUMO = scf_wfn.epsilon_a_subset('AO', 'ALL').np[scf_wfn.nalpha()]
print(HOMO, LUMO, scf_e)
> -0.2529021800443842 -0.006506519238935586 -232.26252757817264

Check log file

!cat out2.dat

  Memory set to   3.725 GiB by Python driver.
  Threads set to 4 by Python driver.

*** tstart() called on takayukis-MacBook-Pro.local
*** at Mon Aug 27 22:26:56 2018

   => Loading Basis Set <=

    Name: CC-PVDZ
    Role: ORBITAL
    Keyword: BASIS
    atoms 1-6  entry C          line   138 file /Users/iwatobipen/.pyenv/versions/anaconda3-4.2.0/share/psi4/basis/cc-pvdz.gbs 
    atoms 7-12 entry H          line    22 file /Users/iwatobipen/.pyenv/versions/anaconda3-4.2.0/share/psi4/basis/cc-pvdz.gbs 

    There are an even number of electrons - assuming singlet.
    Specify the multiplicity in the molecule input block.


         ---------------------------------------------------------
                                   SCF
            by Justin Turney, Rob Parrish, Andy Simmonett
                             and Daniel Smith
                              RKS Reference
                        4 Threads,   3814 MiB Core
         ---------------------------------------------------------

  ==> Geometry <==

    Molecular point group: c1
    Full point group: C1

    Geometry (in Angstrom), charge = 0, multiplicity = 1:

       Center              X                  Y                   Z               Mass       
    ------------   -----------------  -----------------  -----------------  -----------------
         C            1.160791903590    -0.776928934439    -0.079150803890    12.000000000000
         C            1.255739388827     0.616792238521    -0.002681541758    12.000000000000
         C            0.094947479782     1.393721178119     0.076469246393    12.000000000000
         C           -1.160791914824     0.776928940041     0.079150771790    12.000000000000
         C           -1.255739387065    -0.616792248432     0.002681608472    12.000000000000
         C           -0.094947472815    -1.393721173263    -0.076469222838    12.000000000000
         H            2.058813166717    -1.377983059255    -0.140384154914     1.007825032070
         H            2.227214722117     1.093960079368    -0.004756245235     1.007825032070
         H            0.168401517636     2.471943131887     0.135627614553     1.007825032070
         H           -2.058813208566     1.377983069413     0.140383740077     1.007825032070
         H           -2.227214704360    -1.093960120700     0.004756292428     1.007825032070
         H           -0.168401463717    -2.471943107238    -0.135627939511     1.007825032070

  Running in c1 symmetry.

  Rotational constants: A =      0.18924  B =      0.18924  C =      0.09462 [cm^-1]
  Rotational constants: A =   5673.32397  B =   5673.32388  C =   2836.66196 [MHz]
  Nuclear repulsion =  203.019334728971273

  Charge       = 0
  Multiplicity = 1
  Electrons    = 42
  Nalpha       = 21
  Nbeta        = 21

  ==> Algorithm <==

  SCF Algorithm Type is DF.
  DIIS enabled.
  MOM disabled.
  Fractional occupation disabled.
  Guess Type is SAD.
  Energy threshold   = 1.00e-06
  Density threshold  = 1.00e-06
  Integral threshold = 0.00e+00

  ==> Primary Basis <==

  Basis Set: CC-PVDZ
    Blend: CC-PVDZ
    Number of shells: 54
    Number of basis function: 114
    Number of Cartesian functions: 120
    Spherical Harmonics?: true
    Max angular momentum: 2

  ==> DFT Potential <==

   => Composite Functional: B3LYP <= 

    B3LYP Hyb-GGA Exchange-Correlation Functional

    P. J. Stephens, F. J. Devlin, C. F. Chabalowski, and M. J. Frisch, J. Phys. Chem. 98, 11623 (1994)

    Deriv               =              1
    GGA                 =           TRUE
    Meta                =          FALSE

    Exchange Hybrid     =           TRUE
    MP2 Hybrid          =          FALSE

   => Exchange Functionals <=

    0.0800   Slater exchange
    0.7200         Becke 88

   => Exact (HF) Exchange <=

    0.2000               HF 

   => Correlation Functionals <=

    0.1900   Vosko, Wilk & Nusair (VWN5_RPA)
    0.8100   Lee, Yang & Parr

   => Molecular Quadrature <=

    Radial Scheme       =       TREUTLER
    Pruning Scheme      =           FLAT
    Nuclear Scheme      =       TREUTLER

    BS radius alpha     =              1
    Pruning alpha       =              1
    Radial Points       =             75
    Spherical Points    =            302
    Total Points        =         266172
    Total Blocks        =           2084
    Max Points          =            255
    Max Functions       =            114

   => Loading Basis Set <=

    Name: (CC-PVDZ AUX)
    Role: JKFIT
    Keyword: DF_BASIS_SCF
    atoms 1-6  entry C          line   121 file /Users/iwatobipen/.pyenv/versions/anaconda3-4.2.0/share/psi4/basis/cc-pvdz-jkfit.gbs 
    atoms 7-12 entry H          line    51 file /Users/iwatobipen/.pyenv/versions/anaconda3-4.2.0/share/psi4/basis/cc-pvdz-jkfit.gbs 

  ==> Pre-Iterations <==

   -------------------------------------------------------
    Irrep   Nso     Nmo     Nalpha   Nbeta   Ndocc  Nsocc
   -------------------------------------------------------
     A        114     114       0       0       0       0
   -------------------------------------------------------
    Total     114     114      21      21      21       0
   -------------------------------------------------------

  ==> Integral Setup <==

  DFHelper Memory: AOs need 0.070 [GiB]; user supplied 2.794 [GiB]. Using in-core AOs.

  ==> MemDFJK: Density-Fitted J/K Matrices <==

    J tasked:                   Yes
    K tasked:                   Yes
    wK tasked:                   No
    OpenMP threads:               4
    Memory (MB):               2861
    Algorithm:                 Core
    Schwarz Cutoff:           1E-12
    Mask sparsity (%):       0.3693
    Fitting Condition:        1E-12

   => Auxiliary Basis Set <=

  Basis Set: (CC-PVDZ AUX)
    Blend: CC-PVDZ-JKFIT
    Number of shells: 198
    Number of basis function: 558
    Number of Cartesian functions: 636
    Spherical Harmonics?: true
    Max angular momentum: 3

  Minimum eigenvalue in the overlap matrix is 3.7184237071E-04.
  Using Symmetric Orthogonalization.

  SCF Guess: Superposition of Atomic Densities via on-the-fly atomic UHF.

  ==> Iterations <==

                           Total Energy        Delta E     RMS |[F,P]|

   @DF-RKS iter   0:  -232.99504225236745   -2.32995e+02   7.10044e-02 
   @DF-RKS iter   1:  -232.10322360969730    8.91819e-01   8.92096e-03 
   @DF-RKS iter   2:  -232.08890690303650    1.43167e-02   9.91157e-03 DIIS
   @DF-RKS iter   3:  -232.26148696855563   -1.72580e-01   7.72966e-04 DIIS
   @DF-RKS iter   4:  -232.26244386501853   -9.56896e-04   1.55832e-04 DIIS
   @DF-RKS iter   5:  -232.26249567324737   -5.18082e-05   1.40053e-04 DIIS
   @DF-RKS iter   6:  -232.26252736183829   -3.16886e-05   1.12370e-05 DIIS
   @DF-RKS iter   7:  -232.26252757817264   -2.16334e-07   5.75991e-07 DIIS

  ==> Post-Iterations <==

    Orbital Energies [Eh]
    ---------------------

    Doubly Occupied:                                                      

       1A    -10.190567     2A    -10.190357     3A    -10.190356  
       4A    -10.189878     5A    -10.189877     6A    -10.189659  
       7A     -0.851906     8A     -0.745429     9A     -0.745428  
      10A     -0.602584    11A     -0.602584    12A     -0.521871  
      13A     -0.463024    14A     -0.444273    15A     -0.421160  
      16A     -0.421159    17A     -0.365359    18A     -0.344262  
      19A     -0.344262    20A     -0.252905    21A     -0.252902  

    Virtual:                                                              

      22A     -0.006507    23A     -0.006506    24A      0.059666  
      25A      0.099451    26A      0.099452    27A      0.133761  
      28A      0.133764    29A      0.150228    30A      0.153828  
      31A      0.277077    32A      0.277080    33A      0.295653  
      34A      0.295656    35A      0.405118    36A      0.408148  
      37A      0.464413    38A      0.465367    39A      0.512461  
      40A      0.512464    41A      0.517424    42A      0.517430  
      43A      0.519856    44A      0.519868    45A      0.526820  
      46A      0.540745    47A      0.592924    48A      0.592924  
      49A      0.647786    50A      0.647787    51A      0.650076  
      52A      0.650083    53A      0.667968    54A      0.720592  
      55A      0.788201    56A      0.828354    57A      0.857690  
      58A      0.883159    59A      0.883161    60A      0.924558  
      61A      0.992120    62A      0.992122    63A      1.005964  
      64A      1.005973    65A      1.011555    66A      1.011587  
      67A      1.059132    68A      1.071624    69A      1.071625  
      70A      1.216812    71A      1.280232    72A      1.280245  
      73A      1.456675    74A      1.476446    75A      1.476447  
      76A      1.510782    77A      1.538630    78A      1.598096  
      79A      1.598099    80A      1.606414    81A      1.606425  
      82A      1.633651    83A      1.682795    84A      1.682801  
      85A      1.692381    86A      1.692388    87A      1.725702  
      88A      1.795113    89A      1.795115    90A      1.839139  
      91A      1.839159    92A      1.844378    93A      1.848385  
      94A      1.848386    95A      1.883194    96A      1.981556  
      97A      1.981559    98A      1.989993    99A      1.990001  
     100A      2.021139   101A      2.191006   102A      2.261460  
     103A      2.371980   104A      2.426824   105A      2.426829  
     106A      2.435274   107A      2.435279   108A      2.641049  
     109A      2.641053   110A      2.695891   111A      2.795913  
     112A      2.914328   113A      2.914351   114A      3.646462  

    Final Occupation by Irrep:
              A 
    DOCC [    21 ]

  Energy converged.

  @DF-RKS Final Energy:  -232.26252757817264

   => Energetics <=

    Nuclear Repulsion Energy =            203.0193347289712733
    One-Electron Energy =                -713.2777384427657807
    Two-Electron Energy =                 306.1488686089047064
    DFT Exchange-Correlation Energy =     -28.1529924732828292
    Empirical Dispersion Energy =           0.0000000000000000
    VV10 Nonlocal Energy =                  0.0000000000000000
    Total Energy =                       -232.2625275781726373



Properties will be evaluated at   0.000000,   0.000000,   0.000000 [a0]

Properties computed using the SCF density matrix

  Nuclear Dipole Moment: [e a0]
     X:     0.0000      Y:    -0.0000      Z:    -0.0000

  Electronic Dipole Moment: [e a0]
     X:     0.0000      Y:    -0.0000      Z:     0.0000

  Dipole Moment: [e a0]
     X:     0.0000      Y:    -0.0000      Z:    -0.0000     Total:     0.0000

  Dipole Moment: [D]
     X:     0.0000      Y:    -0.0000      Z:    -0.0000     Total:     0.0000


*** tstop() called on takayukis-MacBook-Pro.local at Mon Aug 27 22:27:00 2018
Module time:
	user time   =      13.71 seconds =       0.23 minutes
	system time =       0.22 seconds =       0.00 minutes
	total time  =          4 seconds =       0.07 minutes
Total time:
	user time   =      27.58 seconds =       0.46 minutes
	system time =       0.47 seconds =       0.01 minutes
	total time  =         32 seconds =       0.53 minutes

That’s all. I am happy because I can get many response through the my blog post and I can have many opportunity to learn many things.

Calculate HOMO and LUMO with Psi4 #RDKit #Psi4

You know Psi4 is an open-source suite of ab initio quantum chemistry programs designed for efficient, high-accuracy simulations of a variety of molecular properties. It is very easy to use and has an optional Python interface.
It is useful for us I think. Because Psi4 can use in python, it means we can integrate many libraries in python!  And it is worth to know that, to communicate numpy and psi4 is very easy.

Today, I conducted HOMO LUMO calculation with Psi4 and RDKit.

HOMO-LUMO gap is often used to estimate risk of drug-induced phototoxicity.

Let's start! ;-)

At first, import libraries and define the mol2xyz function. The function generates a conformation and converts RDKit mol object to xyz format.

psi4.core.set_output_file(“output1.dat”, True) is used to logging.

import psi4
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
psi4.core.set_output_file("output1.dat", True)
def mol2xyz(mol):
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol, useExpTorsionAnglePrefs=True,useBasicKnowledge=True)
    AllChem.UFFOptimizeMolecule(mol)
    atoms = mol.GetAtoms()
    string = "\n"
    for i, atom in enumerate(atoms):
        pos = mol.GetConformer().GetAtomPosition(atom.GetIdx())
        string += "{} {} {} {}\n".format(atom.GetSymbol(), pos.x, pos.y, pos.z)
    string += "units angstrom\n"
    return string, mol

Next, calculate HOMO-LUMO of benzene with the function and psi4.

mol = Chem.MolFromSmiles("c1ccccc1")
xyz, mol=mol2xyz(mol)
psi4.set_memory('4 GB')
psi4.set_num_threads(4)
benz = psi4.geometry(xyz)
%time scf_e, scf_wfn = psi4.energy("B3LYP/cc-pVDZ", return_wfn=True)
>CPU times: user 13.5 s, sys: 207 ms, total: 13.7 s
>Wall time: 4.46 s

I set return_wfn argument is True because I want to get wave function information.
Energy calculation need many CPU time. To perform many compound calculation, I need more machine power!

After the calculation, I could access HOMO-LUMO, the code is below.

HOMO = scf_wfn.epsilon_a_subset('AO', 'ALL').np[scf_wfn.nalpha()]
LUMO = scf_wfn.epsilon_a_subset('AO', 'ALL').np[scf_wfn.nalpha() + 1]
print(HOMO, LUMO, scf_e)
>-0.006507529999155065 -0.006506586740442874 -232.26253075556204

Yah, easy… But the value is quite different from some literatures. Hmm Am I wrong ? Maybe yes…
Any advice and suggestions will be greatly appreciated.

It is very easy to check log.
I am reading API and reference of Psi4 now!

!cat out.dat

 Memory set to   3.725 GiB by Python driver.
  Threads set to 4 by Python driver.

  Memory set to   3.725 GiB by Python driver.
  Threads set to 4 by Python driver.

*** tstart() called on ********
*** at Fri Aug 24 22:21:33 2018

   => Loading Basis Set <=

    Name: CC-PVDZ
    Role: ORBITAL
    Keyword: BASIS
---------snip------------

    There are an even number of electrons - assuming singlet.
    Specify the multiplicity in the molecule input block.


         ---------------------------------------------------------
                                   SCF
            by Justin Turney, Rob Parrish, Andy Simmonett
                             and Daniel Smith
                              RKS Reference
                        4 Threads,   3814 MiB Core
         ---------------------------------------------------------

  ==> Geometry <==

    Molecular point group: c1
    Full point group: C1

    Geometry (in Angstrom), charge = 0, multiplicity = 1:

       Center              X                  Y                   Z               Mass       
    ------------   -----------------  -----------------  -----------------  -----------------
         C            0.806497703012    -1.143092245995     0.014915282772    12.000000000000
         C            1.393280339340     0.126835003153    -0.002136692164    12.000000000000
         C            0.586782120312     1.269928194219    -0.017051393298    12.000000000000
         C           -0.806497629817     1.143092609376    -0.014913861108    12.000000000000
         C           -1.393280087652    -0.126835526704     0.002137132311    12.000000000000
         C           -0.586782336814    -1.269927969465     0.017053614836    12.000000000000
         H            1.430428517155    -2.027419681898     0.026439293008     1.007825032070
         H            2.471161063731     0.224955697903    -0.003795091078     1.007825032070
         H            1.040731718362     2.252381167101    -0.030251698606     1.007825032070
         H           -1.430425978428     2.027420563468    -0.026458153907     1.007825032070
         H           -2.471160894666    -0.224959372018     0.003781521713     1.007825032070
         H           -1.040735716617    -2.252379143548     0.030235509138     1.007825032070

  Running in c1 symmetry.

  Rotational constants: A =      0.18924  B =      0.18924  C =      0.09462 [cm^-1]
  Rotational constants: A =   5673.32470  B =   5673.32274  C =   2836.66186 [MHz]
  Nuclear repulsion =  203.019333245138824

  Charge       = 0
  Multiplicity = 1
  Electrons    = 42
  Nalpha       = 21
  Nbeta        = 21

  ==> Algorithm <==

  SCF Algorithm Type is DF.
  DIIS enabled.
  MOM disabled.
  Fractional occupation disabled.
  Guess Type is SAD.
  Energy threshold   = 1.00e-06
  Density threshold  = 1.00e-06
  Integral threshold = 0.00e+00

  ==> Primary Basis <==

  Basis Set: CC-PVDZ
    Blend: CC-PVDZ
    Number of shells: 54
    Number of basis function: 114
    Number of Cartesian functions: 120
    Spherical Harmonics?: true
    Max angular momentum: 2

  ==> DFT Potential <==

   => Composite Functional: B3LYP <= 

    B3LYP Hyb-GGA Exchange-Correlation Functional

    P. J. Stephens, F. J. Devlin, C. F. Chabalowski, and M. J. Frisch, J. Phys. Chem. 98, 11623 (1994)

    Deriv               =              1
    GGA                 =           TRUE
    Meta                =          FALSE

    Exchange Hybrid     =           TRUE
    MP2 Hybrid          =          FALSE

   => Exchange Functionals <=

    0.0800   Slater exchange
    0.7200         Becke 88

   => Exact (HF) Exchange <=

    0.2000               HF 

   => Correlation Functionals <=

    0.1900   Vosko, Wilk & Nusair (VWN5_RPA)
    0.8100   Lee, Yang & Parr

   => Molecular Quadrature <=

    Radial Scheme       =       TREUTLER
    Pruning Scheme      =           FLAT
    Nuclear Scheme      =       TREUTLER

    BS radius alpha     =              1
    Pruning alpha       =              1
    Radial Points       =             75
    Spherical Points    =            302
    Total Points        =         266204
    Total Blocks        =           2062
    Max Points          =            256
    Max Functions       =            114

   => Loading Basis Set <=

    Name: (CC-PVDZ AUX)
    Role: JKFIT
    Keyword: DF_BASIS_SCF
-----snip--------
  ==> Pre-Iterations <==

   -------------------------------------------------------
    Irrep   Nso     Nmo     Nalpha   Nbeta   Ndocc  Nsocc
   -------------------------------------------------------
     A        114     114       0       0       0       0
   -------------------------------------------------------
    Total     114     114      21      21      21       0
   -------------------------------------------------------

  ==> Integral Setup <==

  DFHelper Memory: AOs need 0.070 [GiB]; user supplied 2.794 [GiB]. Using in-core AOs.

  ==> MemDFJK: Density-Fitted J/K Matrices <==

    J tasked:                   Yes
    K tasked:                   Yes
    wK tasked:                   No
    OpenMP threads:               4
    Memory (MB):               2861
    Algorithm:                 Core
    Schwarz Cutoff:           1E-12
    Mask sparsity (%):       0.3693
    Fitting Condition:        1E-12

   => Auxiliary Basis Set <=

  Basis Set: (CC-PVDZ AUX)
    Blend: CC-PVDZ-JKFIT
    Number of shells: 198
    Number of basis function: 558
    Number of Cartesian functions: 636
    Spherical Harmonics?: true
    Max angular momentum: 3

  Minimum eigenvalue in the overlap matrix is 3.7184246400E-04.
  Using Symmetric Orthogonalization.

  SCF Guess: Superposition of Atomic Densities via on-the-fly atomic UHF.

  ==> Iterations <==

                           Total Energy        Delta E     RMS |[F,P]|

   @DF-RKS iter   0:  -232.98648910745848   -2.32986e+02   7.09616e-02 
   @DF-RKS iter   1:  -232.10265029904261    8.83839e-01   8.93885e-03 
   @DF-RKS iter   2:  -232.08812748497797    1.45228e-02   9.76151e-03 DIIS
   @DF-RKS iter   3:  -232.26148987002321   -1.73362e-01   7.73194e-04 DIIS
   @DF-RKS iter   4:  -232.26244676165251   -9.56892e-04   2.20816e-04 DIIS
   @DF-RKS iter   5:  -232.26249884952190   -5.20879e-05   1.40071e-04 DIIS
   @DF-RKS iter   6:  -232.26253054068945   -3.16912e-05   1.12245e-05 DIIS
   @DF-RKS iter   7:  -232.26253075556204   -2.14873e-07   5.66817e-07 DIIS

  ==> Post-Iterations <==

    Orbital Energies [Eh]
    ---------------------

    Doubly Occupied:                                                      

       1A    -10.190570     2A    -10.190360     3A    -10.190358  
       4A    -10.189881     5A    -10.189879     6A    -10.189662  
       7A     -0.851912     8A     -0.745431     9A     -0.745426  
      10A     -0.602585    11A     -0.602585    12A     -0.521876  
      13A     -0.463024    14A     -0.444275    15A     -0.421161  
      16A     -0.421155    17A     -0.365362    18A     -0.344264  
      19A     -0.344262    20A     -0.252907    21A     -0.252901  

    Virtual:                                                              

      22A     -0.006508    23A     -0.006507    24A      0.059658  
      25A      0.099450    26A      0.099455    27A      0.133761  
      28A      0.133763    29A      0.150228    30A      0.153828  
      31A      0.277076    32A      0.277078    33A      0.295654  
      34A      0.295658    35A      0.405112    36A      0.408146  
      37A      0.464412    38A      0.465356    39A      0.512461  
      40A      0.512461    41A      0.517423    42A      0.517433  
      43A      0.519852    44A      0.519872    45A      0.526750  
      46A      0.540745    47A      0.592922    48A      0.592924  
      49A      0.647783    50A      0.647787    51A      0.650079  
      52A      0.650080    53A      0.667968    54A      0.720591  
      55A      0.788173    56A      0.828353    57A      0.857688  
      58A      0.883160    59A      0.883161    60A      0.924558  
      61A      0.992121    62A      0.992123    63A      1.005962  
      64A      1.005979    65A      1.011551    66A      1.011617  
      67A      1.059133    68A      1.071623    69A      1.071625  
      70A      1.216802    71A      1.280227    72A      1.280249  
      73A      1.456674    74A      1.476443    75A      1.476447  
      76A      1.510781    77A      1.538598    78A      1.598094  
      79A      1.598098    80A      1.606419    81A      1.606454  
      82A      1.633651    83A      1.682793    84A      1.682805  
      85A      1.692380    86A      1.692389    87A      1.725701  
      88A      1.795112    89A      1.795115    90A      1.839139  
      91A      1.839180    92A      1.844376    93A      1.848382  
      94A      1.848384    95A      1.883194    96A      1.981552  
      97A      1.981560    98A      1.989989    99A      1.990002  
     100A      2.021138   101A      2.191006   102A      2.261459  
     103A      2.371976   104A      2.426825   105A      2.426828  
     106A      2.435277   107A      2.435291   108A      2.641052  
     109A      2.641053   110A      2.695891   111A      2.795915  
     112A      2.914318   113A      2.914347   114A      3.646453  

    Final Occupation by Irrep:
              A 
    DOCC [    21 ]

  Energy converged.

  @DF-RKS Final Energy:  -232.26253075556204

   => Energetics <=

    Nuclear Repulsion Energy =            203.0193332451388244
    One-Electron Energy =                -713.2776809379226961
    Two-Electron Energy =                 306.1488028199291875
    DFT Exchange-Correlation Energy =     -28.1529858827073518
    Empirical Dispersion Energy =           0.0000000000000000
    VV10 Nonlocal Energy =                  0.0000000000000000
    Total Energy =                       -232.2625307555620680



Properties will be evaluated at   0.000000,   0.000000,   0.000000 [a0]

Properties computed using the SCF density matrix

  Nuclear Dipole Moment: [e a0]
     X:    -0.0000      Y:    -0.0000      Z:    -0.0000

  Electronic Dipole Moment: [e a0]
     X:     0.0000      Y:    -0.0000      Z:     0.0000

  Dipole Moment: [e a0]
     X:    -0.0000      Y:    -0.0000      Z:    -0.0000     Total:     0.0000

  Dipole Moment: [D]
     X:    -0.0000      Y:    -0.0000      Z:    -0.0000     Total:     0.0000


*** tstop() ********l at Fri Aug 24 22:21:37 2018
Module time:
	user time   =      13.48 seconds =       0.22 minutes
	system time =       0.21 seconds =       0.00 minutes
	total time  =          4 seconds =       0.07 minutes
Total time:
	user time   =      13.48 seconds =       0.22 minutes
	system time =       0.21 seconds =       0.00 minutes
	total time  =          4 seconds =       0.07 minutes

​