Detect outlier with Isolationforest #chemoinformatics #scikit-learn #RDKit

Applicability domain(AD) is often discussed in chemoinformatics for QSAR.

Because current QSAR often can’t predict unsimilar compounds against training set. The definition of unsimilar is often used with tanimoto similarity with molecular fingerprint.

So QSAR model user need to take care about where new designed molecule in the model AD or not.

Some days ago, I read interesting article reported in arxiv about synthetic route design with machine learning. The article used isolationforest for outlier detection.

Isolation forest is a method for outlier detection. It seems useful for AD definition in QSAR I think. So I tried to use the method for molecules.

Following code is very simple. At first calculate Morganfingerprint.

Second, conducted PCA for dimension reduction.

Third, fit the data to isolation forest model. After building the model, I can use predict method which returns 1(not outlier) or -1 (outlier).

OK let’s go coding!

Import library and load SDF.

%matplotlib inline
import os
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit import RDConfig
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
import numpy as np'ggplot')
sdf = os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf')
mols = [m for m in Chem.SDMolSupplier(sdf)]
for mol in mols:

Then calculate molecular fingerprint and perform PCA. It is very easy to do it ;)

X = [list(AllChem.GetMorganFingerprintAsBitVect(mol, 2)) for mol in mols]
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pcares = pca.fit_transform(X)

Next, train isolationforest wish calculated fingerprint.

isoforest = IsolationForest(contamination='legacy')
isoforest_res = isoforest.fit_predict(pcares)

Now almost ready. Map in or out with meshed grid and plot training data on the sheet. Isolation class has decision function which decides where the input data is outlier or not.

xx, yy = np.meshgrid(np.linspace(-5,5, 50), np.linspace(-5,5, 50))
cmap = {1:'blue', -1:'red'}
cs = [cmap[k] for k in isoforest_res]
z = isoforest.decision_function(np.c_[xx.ravel(), yy.ravel()])
plt.contourf(xx, yy, z,

b1 = plt.scatter(pcares[:, 0], pcares[:, 1], c=cs,
                 s=20, edgecolor='k')
plt.xlim((-5, 5))
plt.ylim((-5, 5))

The most outer points are assigned as outlier(red point) however the results is not so clear.

You can see more clear data in Scikit-learn document.

Apply isolation forest against PCA result is not suitable for AD detection I think.

But the method is useful for outlier detection.

By the way, medicinal chemist can predict molecular property with high accuracy when their project is late stage. So QSAR model doesn’t work well. But early stage of drug discovery is limited data. So it is difficult to build suitable QSAR model for the project.

So I think design experiment space is very difficult. How to use QSAR model and to build experiment space are key point of current medicinal chemistry.

How do you design SAR space and take data?

Today’s code URL is below.


New fluolination reaction from JACS #memo #organicchemistry

Trifuloromethylation is useful reaction in drug discovery. Because CF3 and other fluorinated substituent are often used in drug like molecules. It is also useful that the reaction can conduct in asymmetric manner.

I read interesting article in JACS today published by Trost’s group. The URL of the article is below.

You can see abstract without account.

My boss when I was student liked pi-allyl palladium reaction named Tsuji-Trost reaction. So allyl halide based palladium coupling is familiar for me. But in the reaction allyl bromide is used for substrate. The article uses allyl-fuluoride as a substrate. It is key point of the reaction, it is well designed!

In the reaction TMS-Rf(Rf means CF3, F5Ph and some fluorinated substituent). Fluorine atom from allyl fuluoride activates the TMS-Rf reagent and the reaction proceeds well.

They used chiral ligand in the reaction so these reactions proceeds enantio selective.

You can see many examples in scheme2 and all reaction shows good yield and enantio selectivity. The example in the article was limited in cyclic allyl fuluride. I thought cyclic substrate is suitable for stereo selective reaction compared to acyclic compound.

Process of something new is very exciting for me. Not only drug discovery but also compound synthesis, reaction design and etc…. ;-)

Run rdkit on cloud #RDKit

This is the first post of my new PC with linux env.

Recently I decided to move from MacBookPro(MBP) to Dell XPS15. Because recent MBP doesn’t have nvidia GPU…. It is not so attractive for me. So I got XPS15 which has win10 as default OS. I removed Win10 ASAP and installed ubuntu 18.04.

Now the OS works well without any problems ;-)

I posted about run rdkit on cloud such as google colab before. Google colab is very useful because it can use GPU without charge even if it is limited usage time. And today I tried to use another service named binder.

Binder can integrate github repo and easy to make your own environment on cloud.

I can make same environment on could with binder just write environment.yml like below. Following example is very simple just only use rdkit ;-)

name: py4chemoinfo
        - conda-forge
        - rdkit

After making environment and jupyter notebook on github repositry it is ready to start.

Go binder web site and paste your github repo and push launch!

It will take several minutes for building env with rdkit and after that you can access cloud python env with rdkit!

Github url and generated binder URLs are below.

Reproducibility is important not only wet experiments but also dry experiments such as programming codes.

So binder is useful tool for share the code.

Let’s teleportation! #Quantumcomputing #qiskit #memo

Recently I am learning quantum computing and qiskit coding.
Learning new things is very fun for me.

Today I tried to code quantum teleportation with qiskit.

’Teleportation’ sounds attractive for me I watched many SF movies when I was child ;) BTW, definition of teleportation is ‘the ability to transport a person or object instantly from one place to another ‘.(from wikipedia)

It is not magic.

What is quantum teleportation?

Quantum teleportation is a process by which quantum information (e.g. the exact state of an atom or photon) can be transmitted (exactly, in principle) from one location to another. The method transport information not object.

It seems useful for information transfer.

OK let’s write code. Following circuit uses 3 qubits. Alice has two qubits and Bob has one qubit. Alice would like to transfer her one qubit information |\Psi(1)> Psi is defined as \Psi(1)=\alpha|0> + \beta|1>.

Bob would like to know alpha and beta.

At first import packages and generate \Psi(1) with unitary gate.

from qiskit import QuantumCircuit
from qiskit import QuantumRegister
from qiskit import ClassicalRegister
from qiskit import visualization
from qiskit import BasicAer
from qiskit.aqua import run_algorithm
from qiskit.visualization import plot_histogram
from qiskit import execute
backend = BasicAer.get_backend('qasm_simulator')

q = QuantumRegister(3)
c0 = ClassicalRegister(1,'c0')
c1 = ClassicalRegister(1,'c1')
c2 = ClassicalRegister(1,'c2')

qc = QuantumCircuit(q, c0,c1,c2)
# set first state of q[0] which is psi(1)
# Alice has q[0] and q[1]
# Bob has q[2]
qc.u3(0.3, 0.2, 0.1, q[0])

Next I made quantum circuit. The circuit is made with hadamard gate, controlled not gate, and pauli x, z gate.

qc.barrier(q)[0], q[1])

qc.measure(q[0], c0[0])
qc.measure(q[1], c1[0])
qc.z(q[2]).c_if(c0, 1)
qc.x(q[2]).c_if(c1, 1)

qc.measure(q[2], c2[0])
circuit over view

What does mean the circuit? OK calculate it.

First step is q[0] passed unitary gate and q[1] passed hadamard gate.

q[0] will be U3|0> \rightarrow \alpha|0> + \beta|1>

q[1] will be H|0> \rightarrow 1/\sqrt{2}(|0> + |1>)

So after U3 and H gate, circuit state will be…

1/\sqrt{2}(\alpha|0> + \beta|1>) \otimes (|0> + |1>) \otimes |0>
= 1/\sqrt{2}(\alpha|000> + \alpha|010> + \beta|100> + \beta|110> )

Then q[1] passed cx gate. Then qubit will be below.

1/\sqrt{2}(\alpha|000> + \alpha|011> + \beta|100> + \beta|111> )

Then q[0] passed cx gate. qubit will be.

1/\sqrt{2}(\alpha|000> + \alpha|011> + \beta|110> + \beta|101> )
=> 1/\sqrt{2}(\alpha|000> + \alpha|011> + \beta|110> + \beta|101> )

Now q[2] can not separate it is entangle state.

Then, q[0] passed hadamard gate….

1/2(\alpha(|0>+|1>)|00> + \alpha(|0>+|1>)|11> + \beta(|0>-|1>)|10> + \beta(|0>-|1>)|01> )

=> 1/2(\alpha(|000>+|100>) + \alpha(|011>+|111>) + \beta(|010>-|110>) + \beta(|001>-|101>))

=> 1/2( |00>(\alpha|0>+\beta|1>) + |01>(\alpha|1>+\beta|0>) + |10>(\alpha|0>-\beta|1>) + |11>(\alpha|1>-\beta|0>))

Now entangle state is solved.
Bob can get Alices information by applying pauli x, z gate combination depends on q[0], q[1] measurements.

If q[0], q[1] = 00, q[2] is \alpha|0>+\beta|1>, no operation is required.

If q[0], q[1] = 01, q[1] is \alpha|1>+\beta|0>, it is required Pauli x gate.

If q[0], q[1] = 10, q[1] is \alpha|0>-\beta|1>, it is required Pauli z gate.

Finally, q[0], q[1] = 11, q[1] is \alpha|0>-\beta|1>, it is required Pauli z gate.

Alice need to tell Bob her measurements of q[0], q[1], so Bob can not information faster than light.

Let’s run the circuit

res = execute(qc, backend, shots=1024)

c[2] is top of the 3 bit numbers. Most of c[2] is 0. And how about alpha and beta?

Alpha = 0.245+0.258+0.219+0.256 = 0.978

Beta = 0.007+0.005+0.005+0.006 = 0.023

Check q[0] first state.

q1 = QuantumRegister(1)
c5 = ClassicalRegister(1)
qc2 = QuantumCircuit(q1,c5)
qc2.u3(0.3, 0.2, 0.1, q1)
qc2.measure(q1, c5)
j = execute(qc2, backend, shots=1024)

Probability is almost same. If the code run without U3gate, c[2] will be only0.

I do not fully understand quantum teleportation now. So I’m reading some papers about it.

Today’s code is below.

Node feature importance of Graph convolutional neural network #chemoinformatics #memo

I wrote blog post about GCN with pytorch_geometric before.

Recently graph based approach is very hot area in chemoinformatics I think. Lots of articles are published with new architecture. However there are few discussions about node features.

Recently I am thinking about the importance of node feature. Today I checked it with very simple code.

I made GCNN with node feature only atomic symbol information.

Molecule to graph converter is borrowed from deepchem.

OK let’s go to code ;-).

Main code is same as my old blog post. Difference is mol2vec.

%matplotlib inline
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
import torch
import torch.nn.functional as F
from torch.nn import Linear
from torch.nn import BatchNorm1d
from import Dataset
from torch_geometric.nn import GCNConv
from torch_geometric.nn import ChebConv
from torch_geometric.nn import global_add_pool, global_mean_pool
from import DataLoader
from torch_scatter import scatter_mean
import mol2graph
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw"ggplot")
train_mols = [m for m in Chem.SDMolSupplier('solubility.train.sdf')]
test_mols = [m for m in Chem.SDMolSupplier('solubility.test.sdf')]
sol_cls_dict = {'(A) low':0, '(B) medium':1, '(C) high':2}
print(len(train_mols), len(test_mols))
>1025 257
> {'ID': 1, 'NAME': 'n-pentane', 'SOL': -3.18, 'SOL_classification': '(A) low', 'smiles': 'CCCCC'}

Prepare data for pytorch_geometric. Deepchem uses 21 different atoms for node featurizetion.

train_X = [mol2graph.mol2vec(m) for m in train_mols]
for i, data in enumerate(train_X):
    y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)

test_X = [mol2graph.mol2vec(m) for m in test_mols]
for i, data in enumerate(test_X):
    y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)
train_loader = DataLoader(train_X, batch_size=64, shuffle=True, drop_last=True)
test_loader = DataLoader(test_X, batch_size=64, shuffle=True, drop_last=True)
possible_atom_list = [
     'C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu',
      'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'
n_features = 21

OK, next define neural net, train and test it!

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = GCNConv(n_features, 128, cached=False) # if you defined cache=True, the shape of batch must be same!
        self.bn1 = BatchNorm1d(128)
        self.conv2 = GCNConv(128, 64, cached=False)
        self.bn2 = BatchNorm1d(64)
        self.fc1 = Linear(64, 64)
        self.bn3 = BatchNorm1d(64)
        self.fc2 = Linear(64, 64)
        self.fc3 = Linear(64, 3)
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = self.bn1(x)
        x = F.relu(self.conv2(x, edge_index))
        x = self.bn2(x)
        x = global_add_pool(x, data.batch)
        x = F.relu(self.fc1(x))
        x = self.bn3(x)
        x = F.relu(self.fc2(x))
        x = F.dropout(x, p=0.2,
        x = self.fc3(x)
        x = F.log_softmax(x, dim=1)
        return x        
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
#optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train(epoch):
    loss_all = 0
    for data in train_loader:
        data =
        output = model(data)
        loss = F.nll_loss(output, data.y)
        loss_all += loss.item() * data.num_graphs
    return loss_all / len(train_X)
def test(loader):
    correct = 0
    for data in loader:
        data =
        output = model(data)
        pred = output.max(dim=1)[1]
        correct += pred.eq(data.y).sum().item()
    return correct / len(loader.dataset)

hist = {"loss":[], "acc":[], "test_acc":[]}
for epoch in range(1, 101):
    train_loss = train(epoch)
    train_acc = test(train_loader)
    test_acc = test(test_loader)
    print(f'Epoch: {epoch}, Train loss: {train_loss:.3}, Train_acc: {train_acc:.3}, Test_acc: {test_acc:.3}')

The output is below.

>Epoch: 1, Train loss: 0.829, Train_acc: 0.448, Test_acc: 0.447
>Epoch: 100, Train loss: 0.369, Train_acc: 0.841, Test_acc: 0.782

Hmm it is not so bat isn’t it?

ax = plt.subplot(1,1,1)
ax.plot([e for e in range(1,101)], hist["loss"], label="train_loss")
ax.plot([e for e in range(1,101)], hist["acc"], label="train_acc")
ax.plot([e for e in range(1,101)], hist["test_acc"], label="test_acc")
result with only atom symbol feature and atom-atom connection

And following image is result of my old post with 75dim features which uses symbol, valence, formal charge, hybridization and number of radical electron.

result with whole features

Result with whole node features showed slightly better performance but almost same performance.

Is it indicate that neural net can learn feature from very simple information?

Is the architecture key for GCN not node feature?

I’m not specialist of graph based learning. Any comments or suggestions will be greatly appreciated.

Photredox reaction on DNA #memo #chemistry

Today I found very interesting article from researchers who is working at pfizer and Hitgen.

Pfizer and Hitgen launched drug discovery collaboration two years ago.

Hitgen is one of CRO with DNA-encoded library as core technology.

In the article the authors reported that they conducted photoredox reaction on DNA tag.

Photoredox reaction is very useful tool for medchem because it can perform C(sp3)-C(sp3) coupling in one step which is difficult to do with old traditional chemistry.

DeNA Encoded Library(DELI) is powerful method to generate large number of molecules however available reaction is limited for not DNA damaging reaction. So it is not suitable for DELI production which has harsh reaction conditions.

At first they optimized coupling reaction condition of decarboxylateve coupling between aryl-iodide and aliphatic carboxylic acid. Main side reaction is reductive de haloganation product. But they found good reaction condition and the condition can work very well.

Table 2 shows many examples of the coupling reaction with moderate ~ high yield. And table 3 shows that the reaction can apply for not only phenyl iodide but also hetero cycles. Such as pyridine, pyrimidine etc.

There are few examples at ortho position with moderate yield. Maybe it is steric effect.

Finally they showed example of parallel synthesis via LED Array.

Scheme3-c) shows 96 well prate type LED reactor. It is very beautiful ;)

To prevent oxygen exposure, they degassed and sealed the plate.

And also they tried different type of photoredox reaction such as desulfinative coupling, deborylative coupling adn cross electrophile coupling and all reaction progressed with moderate yield.

Optimized reaction conducted in DMSO/H2O media with room temperature.

Combination of new technology is always exciting isn’t it?

SVM on quantum computing! #quantumcomputing #qiskit

Support vector machine is one of major method of machine learning. It very useful and powerful algorithm. However, SVM is not suitable for large size dataset.

Recently I’m learning QISKIT and qiskit has an attractive method named QSVM.

It means that perform SVM on quantum circuit!

Dose it sound fun?

Original URL of the paper is below.

The author of the article defined quantum kernel functions for quantum SVM. The merit of quantum computing for SVM is parallel. Because quantum computing can generates entangle state which is parallel world.

And fortunately qiskit has QSVM method that means we can QSVM very conveniently!

Today I tried to use QSVM to iris dataset for practice.

At first, import packages and dataset.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import qiskit
from qiskit import BasicAer
from qiskit.aqua import QuantumInstance
from qiskit.aqua.algorithms import QSVM
from qiskit.aqua.components.multiclass_extensions import one_against_rest, all_pairs
from qiskit.aqua.components.feature_maps import SecondOrderExpansion
from qiskit.aqua.input import ClassificationInput
from qiskit.aqua import run_algorithm
backend = BasicAer.get_backend('qasm_simulator')
iris = load_iris()

X, Y =,
train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2)
num_features = 4
training_size = 120
test_size = 30
feature_map = SecondOrderExpansion(feature_dimension=num_features, depth=2)

Then split data for train and test. Iris data set has 4 features so I defined num_features=4, it will be number of qubit.

X, Y =,
train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2)
feature_map = SecondOrderExpansion(feature_dimension=num_features, depth=2) 

Next defined parameters for QSVM and transformed data for QSVM.

params = {
            'problem': {'name': 'classification'},
            'algorithm': {
                'name': 'QSVM',
            'multiclass_extension': {'name': 'OneAgainstRest'},
            'feature_map': {'name': 'SecondOrderExpansion', 'depth': 2 }

params = {
            'problem': {'name': 'classification', 'random_seed': 794},
            'algorithm': {
                'name': 'QSVM',
            'backend': {'shots': 1024},
            'multiclass_extension': {'name': 'OneAgainstRest'},
            #'feature_map': {'name': 'SecondOrderExpansion', 'depth': 2, 'entangler_map': [[0, 1]]}
            'feature_map': {'name': 'SecondOrderExpansion', 'depth': 2}
total_arr = np.concatenate((test_dataset['A'],test_dataset['B'],test_dataset['C']))
alg_input = ClassificationInput(training_dataset, test_dataset, total_arr)

Finally let’s learning!

%time result = run_algorithm(params, algo_input=alg_input, backend=backend)

Simulated QSVM is not so fast. It took 17min on my MacBook(mid2015 model)

Check the performance.
> 0.733

Hmm it is not so high accuracy. Are there many room for improve?

And qiskit has more direct method for QSVM named QSVM.

I would like to use the method and tried. But it caused error.

I couldn’t find reason of above.

  q_instance = QuantumInstance(backend) qsvm = QSVM(feature_map=feature_map,            training_dataset={'A':train_x[train_y==0],                             'B':train_x[train_y==1],                             'C':train_x[train_y==2]},            test_dataset={'A':test_x[test_y==0],                         'B':test_x[test_y==1],                         'C':test_x[test_y==2]},            multiclass_extension=one_against_rest.OneAgainstRest)

It is very interesting for me about quantum computing.

programming and quantum computing is very interesting for me!!!!