Use SAPT/FISAPT more easily #chemoinformatics #psikit #quantumchemistry

Now I’m developing new method for psikit. So this new function is not imlemented in original repository yet but I checked it in my repo.

The new method is named SAPT.

Symmetry-adapted perturbation theory (SAPT) provides a means of directly computing the noncovalent interaction between two molecules. And as same as fragment orbital method(FMO)’s PIEDA (pair interaction energy decomposition) SAPT provides a decomposition of interaction energy: i.e., electrostatic, exchange, induction, and dispersion terms.

And F/ISAPT, functional group or intermolecular SAPT provides an effective two-body partition of the various SAPT terms to localized chemical functional groups (F-SAPT). I think SAPT and especially F-SAPT is useful for chemist however, original psi4 implementation is little bit difficult to make input files.

So I’m trying to implement the function to psikit and make easy to use FSAPT/SAPT.

For SAPT, psi4numpy repository provides useful example for SAPT calculation so most of part borrowed from the repo.

And following code is the SAPT basic usage. For convenience, to run SAPT user need to provide two mol files for calculation which has 3D structure as molfie.

My idea of the implementation, I defined Sapt class which is separately to Psikit class. And code is below.

from psikit import Sapt
sapt=Sapt()
# load monomer 1 and 2 from molfiles
sapt.monomer1_from_molfile('water1.mol')
sapt.monomer2_from_molfile('water2.mol')
# then make dimer geometry format.
# sapt.make_dimer()
# get SAPT0 result. It is easy to do it just call run_sapt()
sapt0, Exch100, Elst10, Disp200, ExchDisp20, Ind20r, ExchInd20r = sapt.run_sapt()
print('total sapt0', sapt0)
print('Exch10 (S^2)', Exch100)
print('Elst10', Elst10)
print('Disp20', Disp200)
print('Exch-Disp20', ExchDisp20)
print('Ind20,r', Ind20r)
print('Exch-Ind20,r', ExchInd20r)

Next example is FSAPT. I had some troubles for the implementation. Because FSAPT of psi4 provides useful post-processing script named fsapt.py but user need to provide fA.dat and fB.dat files which define atom index of the functional groups. At first, I tried to use rdkit pharmacophore modules to make the file. But it was failed because multiple assignment of one atom is not allowed the calculation. So I switched from pharmacophore function to RECAP function. And also I added some helper function.

Finally I checked the function with phenol dimer.

I think the dev version of psikit can perform FSAPT calculation easily.

import psikit
sapt = psikit.Sapt()
sapt.monomer1_from_molfile('phenol1.mol')
sapt.monomer2_from_molfile('phenol2.mol')
sapt.make_dimer()
sapt.run_fisapt()

After calling the run_fisapt(), fsapt folder is generated and all results are stored the folder and fA.dat, fB.dat files are saved in fsapt folder too.

Then go to fsapt folder and call fsapt.py which is post-process script.

fsapt.dat file can see functional group base interactions like below.

=> Full Analysis <=

Frag1 Frag2 Elst Exch IndAB IndBA Disp Total
O_0 O_0 -8.730 6.201 -0.512 -1.546 -1.118 -5.704
O_0 c1ccccc1_0 2.471 0.686 0.256 -0.346 -0.661 2.406
O_0 Link-1 -0.981 0.029 -0.090 -0.122 -0.149 -1.313
c1ccccc1_0 O_0 -2.992 0.670 -0.083 -0.299 -0.558 -3.263
c1ccccc1_0 c1ccccc1_0 1.833 2.115 0.032 -0.245 -2.287 1.449
c1ccccc1_0 Link-1 -1.116 0.123 -0.074 -0.047 -0.117 -1.231
Link-1 O_0 1.261 0.002 -0.050 0.178 -0.107 1.285
Link-1 c1ccccc1_0 -1.514 0.038 0.027 0.107 -0.105 -1.447
Link-1 Link-1 0.673 0.003 -0.008 0.025 -0.019 0.674
O_0 All -7.239 6.916 -0.347 -2.014 -1.927 -4.611
c1ccccc1_0 All -2.275 2.908 -0.124 -0.590 -2.963 -3.045
Link-1 All 0.419 0.043 -0.030 0.310 -0.231 0.512
All O_0 -10.461 6.873 -0.644 -1.667 -1.783 -7.681
All c1ccccc1_0 2.789 2.839 0.315 -0.484 -3.053 2.407
All Link-1 -1.423 0.155 -0.173 -0.144 -0.285 -1.870
All All -9.095 9.868 -0.502 -2.295 -5.121 -7.145

=> Reduced Analysis <=

Frag1 Frag2 Elst Exch IndAB IndBA Disp Total
O_0 O_0 -8.421 6.218 -0.584 -1.512 -1.250 -5.549
O_0 c1ccccc1_0 1.392 0.720 0.222 -0.347 -0.793 1.194
c1ccccc1_0 O_0 -2.751 0.733 -0.147 -0.227 -0.675 -3.067
c1ccccc1_0 c1ccccc1_0 0.686 2.197 0.007 -0.208 -2.403 0.278
O_0 All -7.030 6.938 -0.362 -1.859 -2.043 -4.356
c1ccccc1_0 All -2.065 2.930 -0.140 -0.435 -3.078 -2.789
All O_0 -11.173 6.951 -0.731 -1.739 -1.925 -8.617
All c1ccccc1_0 2.078 2.917 0.229 -0.556 -3.196 1.472
All All -9.095 9.868 -0.502 -2.295 -5.121 -7.145

And if you have pymol, launch pymol and call @rum.pymol command from pymol command session, you can get interaction view like image below.

This is still test version not fixed method I need to add testcode and need to review. And after that I would like to Pull Request to original repo.

reference
http://www.psicode.org/psi4manual/master/sapt.html?highlight=sapt
http://www.psicode.org/psi4manual/1.2/fisapt.html


Move your Building Block shelf from internal to external! #memo #journal #medchem

When I try to build compound library with parallel chemistry, collecting the Building Blocks and weighting is very time consuming part. And also it will take long time if the required building blocks are not available in internal reagent shelf.

There are some services are provided from reagent vendors which they weights reagents and delivers them. It seems very efficient.

Researchers in Pfizer and MiliporeSigma reported their strategy named Quick Building Blocks(QBB) in ACS MedChem letters.

They designed the strategy because recent vendors has lots of attractive BBs and it is still expanding.

They named the strategy ‘Just-in-time’ purchasing model. They selected vendor which meet their defined criteria listed table1. The criteria has weight custom BB quantities of course. It means that chemists don’t need to do inventory control and can purchase BB for library size/scale.

Table2 shows the shipping metrics of the model for 2016-2018.

In the case of US to US, it took 4 days, and Us to Asia, it took 6-6.5 days.

How do you feel? I think it good if hundreds or more BBs which are weighted delivered in 4 days.

And also BBs usage, properties and any other data can retrieve and analyze.

The data will show trend of reagent usage across the project. It seems interesting.

Many part of drug discovery is replaced externally. What is the core part of pharmaceutical company ?

I’m always thinking about it…

Detect outlier with Isolationforest #chemoinformatics #scikit-learn #RDKit

Applicability domain(AD) is often discussed in chemoinformatics for QSAR.

Because current QSAR often can’t predict unsimilar compounds against training set. The definition of unsimilar is often used with tanimoto similarity with molecular fingerprint.

So QSAR model user need to take care about where new designed molecule in the model AD or not.

Some days ago, I read interesting article reported in arxiv about synthetic route design with machine learning. The article used isolationforest for outlier detection.

Isolation forest is a method for outlier detection. It seems useful for AD definition in QSAR I think. So I tried to use the method for molecules.

Following code is very simple. At first calculate Morganfingerprint.

Second, conducted PCA for dimension reduction.

Third, fit the data to isolation forest model. After building the model, I can use predict method which returns 1(not outlier) or -1 (outlier).

OK let’s go coding!

Import library and load SDF.

%matplotlib inline
import os
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit import RDConfig
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import numpy as np
plt.style.use('ggplot')
sdf = os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf')
mols = [m for m in Chem.SDMolSupplier(sdf)]
for mol in mols:
    AllChem.Compute2DCoords(mol)

Then calculate molecular fingerprint and perform PCA. It is very easy to do it ;)

X = [list(AllChem.GetMorganFingerprintAsBitVect(mol, 2)) for mol in mols]
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pcares = pca.fit_transform(X)

Next, train isolationforest wish calculated fingerprint.

isoforest = IsolationForest(contamination='legacy')
isoforest.fit(pcares)
isoforest_res = isoforest.fit_predict(pcares)

Now almost ready. Map in or out with meshed grid and plot training data on the sheet. Isolation class has decision function which decides where the input data is outlier or not.

xx, yy = np.meshgrid(np.linspace(-5,5, 50), np.linspace(-5,5, 50))
cmap = {1:'blue', -1:'red'}
cs = [cmap[k] for k in isoforest_res]
z = isoforest.decision_function(np.c_[xx.ravel(), yy.ravel()])
plt.title("IsolationForest")
plt.contourf(xx, yy, z, cmap=plt.cm.Blues_r)

b1 = plt.scatter(pcares[:, 0], pcares[:, 1], c=cs,
                 s=20, edgecolor='k')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))

plt.show()

The most outer points are assigned as outlier(red point) however the results is not so clear.

You can see more clear data in Scikit-learn document.

Apply isolation forest against PCA result is not suitable for AD detection I think.

But the method is useful for outlier detection.

By the way, medicinal chemist can predict molecular property with high accuracy when their project is late stage. So QSAR model doesn’t work well. But early stage of drug discovery is limited data. So it is difficult to build suitable QSAR model for the project.

So I think design experiment space is very difficult. How to use QSAR model and to build experiment space are key point of current medicinal chemistry.

How do you design SAR space and take data?

Today’s code URL is below.

https://github.com/iwatobipen/chemobinder

New fluolination reaction from JACS #memo #organicchemistry

Trifuloromethylation is useful reaction in drug discovery. Because CF3 and other fluorinated substituent are often used in drug like molecules. It is also useful that the reaction can conduct in asymmetric manner.

I read interesting article in JACS today published by Trost’s group. The URL of the article is below.
https://pubs.acs.org/doi/10.1021/jacs.9b06231

You can see abstract without account.

My boss when I was student liked pi-allyl palladium reaction named Tsuji-Trost reaction. So allyl halide based palladium coupling is familiar for me. But in the reaction allyl bromide is used for substrate. The article uses allyl-fuluoride as a substrate. It is key point of the reaction, it is well designed!

In the reaction TMS-Rf(Rf means CF3, F5Ph and some fluorinated substituent). Fluorine atom from allyl fuluoride activates the TMS-Rf reagent and the reaction proceeds well.

They used chiral ligand in the reaction so these reactions proceeds enantio selective.

You can see many examples in scheme2 and all reaction shows good yield and enantio selectivity. The example in the article was limited in cyclic allyl fuluride. I thought cyclic substrate is suitable for stereo selective reaction compared to acyclic compound.

Process of something new is very exciting for me. Not only drug discovery but also compound synthesis, reaction design and etc…. ;-)

Run rdkit on cloud #RDKit

This is the first post of my new PC with linux env.

Recently I decided to move from MacBookPro(MBP) to Dell XPS15. Because recent MBP doesn’t have nvidia GPU…. It is not so attractive for me. So I got XPS15 which has win10 as default OS. I removed Win10 ASAP and installed ubuntu 18.04.

Now the OS works well without any problems ;-)

I posted about run rdkit on cloud such as google colab before. Google colab is very useful because it can use GPU without charge even if it is limited usage time. And today I tried to use another service named binder.

Binder can integrate github repo and easy to make your own environment on cloud.

I can make same environment on could with binder just write environment.yml like below. Following example is very simple just only use rdkit ;-)


name: py4chemoinfo
channels:
        - conda-forge
dependencies:
        - rdkit

After making environment and jupyter notebook on github repositry it is ready to start.

Go binder web site and paste your github repo and push launch!

It will take several minutes for building env with rdkit and after that you can access cloud python env with rdkit!

Github url and generated binder URLs are below.

https://github.com/iwatobipen/chemobinder

https://hub.gke.mybinder.org/user/iwatobipen-chemobinder-pq6paghq/notebooks/hello_chemoinfo.ipynb

Reproducibility is important not only wet experiments but also dry experiments such as programming codes.

So binder is useful tool for share the code.

Let’s teleportation! #Quantumcomputing #qiskit #memo

Recently I am learning quantum computing and qiskit coding.
Learning new things is very fun for me.

Today I tried to code quantum teleportation with qiskit.

https://natgeo.nikkeibp.co.jp/atcl/news/16/110100411/

’Teleportation’ sounds attractive for me I watched many SF movies when I was child ;) BTW, definition of teleportation is ‘the ability to transport a person or object instantly from one place to another ‘.(from wikipedia)

It is not magic.

What is quantum teleportation?

Quantum teleportation is a process by which quantum information (e.g. the exact state of an atom or photon) can be transmitted (exactly, in principle) from one location to another. The method transport information not object.

It seems useful for information transfer.

OK let’s write code. Following circuit uses 3 qubits. Alice has two qubits and Bob has one qubit. Alice would like to transfer her one qubit information |\Psi(1)> Psi is defined as \Psi(1)=\alpha|0> + \beta|1>.

Bob would like to know alpha and beta.

At first import packages and generate \Psi(1) with unitary gate.

from qiskit import QuantumCircuit
from qiskit import QuantumRegister
from qiskit import ClassicalRegister
from qiskit import visualization
from qiskit import BasicAer
from qiskit.aqua import run_algorithm
from qiskit.visualization import plot_histogram
from qiskit import execute
backend = BasicAer.get_backend('qasm_simulator')

q = QuantumRegister(3)
c0 = ClassicalRegister(1,'c0')
c1 = ClassicalRegister(1,'c1')
c2 = ClassicalRegister(1,'c2')

qc = QuantumCircuit(q, c0,c1,c2)
# set first state of q[0] which is psi(1)
# Alice has q[0] and q[1]
# Bob has q[2]
qc.u3(0.3, 0.2, 0.1, q[0])

Next I made quantum circuit. The circuit is made with hadamard gate, controlled not gate, and pauli x, z gate.

 
qc.h(q[1])
qc.cx(q[1],q[2])
qc.barrier(q)
qc.cx(q[0], q[1])
qc.h(q[0])
qc.barrier(q)

qc.measure(q[0], c0[0])
qc.measure(q[1], c1[0])
qc.z(q[2]).c_if(c0, 1)
qc.x(q[2]).c_if(c1, 1)

qc.measure(q[2], c2[0])
qc.draw(output='mpl')
circuit over view

What does mean the circuit? OK calculate it.

First step is q[0] passed unitary gate and q[1] passed hadamard gate.

q[0] will be U3|0> \rightarrow \alpha|0> + \beta|1>

q[1] will be H|0> \rightarrow 1/\sqrt{2}(|0> + |1>)

So after U3 and H gate, circuit state will be…

1/\sqrt{2}(\alpha|0> + \beta|1>) \otimes (|0> + |1>) \otimes |0>
= 1/\sqrt{2}(\alpha|000> + \alpha|010> + \beta|100> + \beta|110> )

Then q[1] passed cx gate. Then qubit will be below.

1/\sqrt{2}(\alpha|000> + \alpha|011> + \beta|100> + \beta|111> )

Then q[0] passed cx gate. qubit will be.

1/\sqrt{2}(\alpha|000> + \alpha|011> + \beta|110> + \beta|101> )
=> 1/\sqrt{2}(\alpha|000> + \alpha|011> + \beta|110> + \beta|101> )

Now q[2] can not separate it is entangle state.

Then, q[0] passed hadamard gate….

1/2(\alpha(|0>+|1>)|00> + \alpha(|0>+|1>)|11> + \beta(|0>-|1>)|10> + \beta(|0>-|1>)|01> )

=> 1/2(\alpha(|000>+|100>) + \alpha(|011>+|111>) + \beta(|010>-|110>) + \beta(|001>-|101>))


=> 1/2( |00>(\alpha|0>+\beta|1>) + |01>(\alpha|1>+\beta|0>) + |10>(\alpha|0>-\beta|1>) + |11>(\alpha|1>-\beta|0>))

Now entangle state is solved.
Bob can get Alices information by applying pauli x, z gate combination depends on q[0], q[1] measurements.

If q[0], q[1] = 00, q[2] is \alpha|0>+\beta|1>, no operation is required.

If q[0], q[1] = 01, q[1] is \alpha|1>+\beta|0>, it is required Pauli x gate.

If q[0], q[1] = 10, q[1] is \alpha|0>-\beta|1>, it is required Pauli z gate.

Finally, q[0], q[1] = 11, q[1] is \alpha|0>-\beta|1>, it is required Pauli z gate.

Alice need to tell Bob her measurements of q[0], q[1], so Bob can not information faster than light.

Let’s run the circuit

 
res = execute(qc, backend, shots=1024)
plot_histogram(res.result().get_counts(qc))

c[2] is top of the 3 bit numbers. Most of c[2] is 0. And how about alpha and beta?

Alpha = 0.245+0.258+0.219+0.256 = 0.978

Beta = 0.007+0.005+0.005+0.006 = 0.023

Check q[0] first state.

q1 = QuantumRegister(1)
c5 = ClassicalRegister(1)
qc2 = QuantumCircuit(q1,c5)
qc2.u3(0.3, 0.2, 0.1, q1)
qc2.measure(q1, c5)
qc2.draw(output='mpl')
j = execute(qc2, backend, shots=1024)
plot_histogram(j.result().get_counts(qc2))

Probability is almost same. If the code run without U3gate, c[2] will be only0.

I do not fully understand quantum teleportation now. So I’m reading some papers about it.

Today’s code is below.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/tereportation.ipynb

Node feature importance of Graph convolutional neural network #chemoinformatics #memo

I wrote blog post about GCN with pytorch_geometric before.
https://iwatobipen.wordpress.com/2019/04/05/make-graph-convolution-model-with-geometric-deep-learning-extension-library-for-pytorch-rdkit-chemoinformatics-pytorch/

Recently graph based approach is very hot area in chemoinformatics I think. Lots of articles are published with new architecture. However there are few discussions about node features.

Recently I am thinking about the importance of node feature. Today I checked it with very simple code.

I made GCNN with node feature only atomic symbol information.

Molecule to graph converter is borrowed from deepchem.

OK let’s go to code ;-).

Main code is same as my old blog post. Difference is mol2vec.

%matplotlib inline
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
import torch
import torch.nn.functional as F
from torch.nn import Linear
from torch.nn import BatchNorm1d
from torch.utils.data import Dataset
from torch_geometric.nn import GCNConv
from torch_geometric.nn import ChebConv
from torch_geometric.nn import global_add_pool, global_mean_pool
from torch_geometric.data import DataLoader
from torch_scatter import scatter_mean
import mol2graph
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
plt.style.use("ggplot")
train_mols = [m for m in Chem.SDMolSupplier('solubility.train.sdf')]
test_mols = [m for m in Chem.SDMolSupplier('solubility.test.sdf')]
sol_cls_dict = {'(A) low':0, '(B) medium':1, '(C) high':2}
print(len(train_mols), len(test_mols))
print(train_mols[0].GetPropsAsDict())
>1025 257
> {'ID': 1, 'NAME': 'n-pentane', 'SOL': -3.18, 'SOL_classification': '(A) low', 'smiles': 'CCCCC'}

Prepare data for pytorch_geometric. Deepchem uses 21 different atoms for node featurizetion.

 
print(len(mol2graph.possible_atom_list))
train_X = [mol2graph.mol2vec(m) for m in train_mols]
for i, data in enumerate(train_X):
    y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)

test_X = [mol2graph.mol2vec(m) for m in test_mols]
for i, data in enumerate(test_X):
    y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)
>21
train_loader = DataLoader(train_X, batch_size=64, shuffle=True, drop_last=True)
test_loader = DataLoader(test_X, batch_size=64, shuffle=True, drop_last=True)
"""
possible_atom_list = [
     'C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu',
      'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'
  ]
"""
n_features = 21

OK, next define neural net, train and test it!

 
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = GCNConv(n_features, 128, cached=False) # if you defined cache=True, the shape of batch must be same!
        self.bn1 = BatchNorm1d(128)
        self.conv2 = GCNConv(128, 64, cached=False)
        self.bn2 = BatchNorm1d(64)
        self.fc1 = Linear(64, 64)
        self.bn3 = BatchNorm1d(64)
        self.fc2 = Linear(64, 64)
        self.fc3 = Linear(64, 3)
        
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = self.bn1(x)
        x = F.relu(self.conv2(x, edge_index))
        x = self.bn2(x)
        x = global_add_pool(x, data.batch)
        x = F.relu(self.fc1(x))
        x = self.bn3(x)
        x = F.relu(self.fc2(x))
        x = F.dropout(x, p=0.2, training=self.training)
        x = self.fc3(x)
        x = F.log_softmax(x, dim=1)
        return x        
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
#optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train(epoch):
    model.train()
    loss_all = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, data.y)
        loss.backward()
        loss_all += loss.item() * data.num_graphs
        optimizer.step()
    return loss_all / len(train_X)
def test(loader):
    model.eval()
    correct = 0
    for data in loader:
        data = data.to(device)
        output = model(data)
        pred = output.max(dim=1)[1]
        correct += pred.eq(data.y).sum().item()
    return correct / len(loader.dataset)

hist = {"loss":[], "acc":[], "test_acc":[]}
for epoch in range(1, 101):
    train_loss = train(epoch)
    train_acc = test(train_loader)
    test_acc = test(test_loader)
    hist["loss"].append(train_loss)
    hist["acc"].append(train_acc)
    hist["test_acc"].append(test_acc)
    print(f'Epoch: {epoch}, Train loss: {train_loss:.3}, Train_acc: {train_acc:.3}, Test_acc: {test_acc:.3}')

The output is below.

>Epoch: 1, Train loss: 0.829, Train_acc: 0.448, Test_acc: 0.447
''''
>Epoch: 100, Train loss: 0.369, Train_acc: 0.841, Test_acc: 0.782

Hmm it is not so bat isn’t it?

 
ax = plt.subplot(1,1,1)
ax.plot([e for e in range(1,101)], hist["loss"], label="train_loss")
ax.plot([e for e in range(1,101)], hist["acc"], label="train_acc")
ax.plot([e for e in range(1,101)], hist["test_acc"], label="test_acc")
plt.xlabel("epoch")
ax.legend()
result with only atom symbol feature and atom-atom connection

And following image is result of my old post with 75dim features which uses symbol, valence, formal charge, hybridization and number of radical electron.

result with whole features

Result with whole node features showed slightly better performance but almost same performance.

Is it indicate that neural net can learn feature from very simple information?

Is the architecture key for GCN not node feature?

I’m not specialist of graph based learning. Any comments or suggestions will be greatly appreciated.