Thin book of chemoinformatics_en #chemoinformatcs

I really happy to having the opportunity for writing the thin book of chemoinformatics. I want to express my gratitude to @fmkz__.

The book is written in Japanese and I got some request from my follower about English translation. It is tough work for me, I thought it will take long time to do it. However, fortunately wonderful co-workers appeared! And the work goes well. Thank joofio and I am happy to work with him.

Today I just finished making English version PDF. Then I sent PR to the repo.

Draft version can get from my github repo.

I will be happy if it is useful for someone who is interested in chemoinformatics. The book has some sample jupyter notebook code. So readers can run the code on your PC.

By the way, I would like to introduce the girl who is in cover picture.

Her name is ‘Kusuri Murakumo’ called ‘Souyaku-chan’. She is a child of social drug discovery. ;)

Souyaku means drug discovery in Japanese. She organized drug discovery competition. It is very exciting event for us. Recently there are many open source tools for drug discovery. So if you want, you can do virtual screening on your private PC. Some years ago foldit solved complex protein structure with gamification technique. She tries same manner to discover new drug. It seems interesting isn’t it?


Handling chemoinformatics data with pandas #RDKit #chemoinformatics

I often use Pandas for data analysis. RDKit provides useful method named PandasTools. The method can load sdf and return data as pandas dataframe. By using dataframe, It isn’t needed to do something with for loop.

I found an interesting information in rdkit issues. A package named pyjanitor.

The package wraps pandas and provides useful method not only basic data handling but also chemoinformatics.

The package can be installed with conda from conda-forge channel. I used the library. Following code is example on my jupyter notebook. At first, read libraries for the trial.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDConfig

import pandas as pd
import janitor
from janitor import chemistry

from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA'ggplot')

Then load sdf data which is provided from rdkit.

path = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf')
df = PandasTools.LoadSDF(path)

pyjanitor has chemoinformatics function, one is morgan_fingerprint. It can calculate data and add the result in current dataframe. To do it, it is very easy ;)

fp1=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='bits')

Now I can get fingerprint information as pandas dataframe which shape is (num of compounds, num of bits).

Conduct PCA with fingerprint.

fp2=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='counts')
pca = PCA(n_components=3)
pca_res = pca.fit_transform(fp2)

janitor.chemistry has smi2mol method. It converts SMILES to mol objects in pandas dataframe.

df['SMILES'] = df.ROMol.apply(Chem.MolToSmiles)
df = chemistry.smiles2mol(df, smiles_col='SMILES', mols_col='newROMol', progressbar='notebook')

And janitor has add_column method. To use the method, I can add multiple columns with chain method.

# add_column function is not native pandas method.
# df['NumAtm'] = df.ROMol.apply(Chem.Mol.GetAtoms) 
df = df.add_column('NumAtm', [Chem.Mol.GetNumAtoms(mol) for mol in df.ROMol])

It is easy to make fingerprint dataframe with janitor. And pyjanitor has other useful methods for pandas data handling.


2019度初試合 #dodgeball







New version of openforcefield supports RDKit! #RDKit #chemoinformatics #openforcefield

In this morning I found great news on my tiwtter TL that openforcefield supports RDKit! Older version of openforcefiled is required openeyeTK which is commercial license for industry. I had interested in the package but could not install because I’m not academia and our company does not OETK license now.

I can’t wait to use it so I installed the package and check it immediately.
Following code is ran on google colab. If would like to run on local env, you can skip several steps.

# install conda
!chmod ./
!time bash ./ -b -f -p /usr/local
# install rdkit via conda command
!time conda install -c conda-forge rdkit -y

import sys
import os
from rdkit import Chem
from rdkit import rdBase

Then install openforcefield. Openforcefield is required openmm. Cuda version of colab was 10.0 so I did not need change of openmm install version. So, just type install ‘conda install openforcefield’. ;)

! conda config --add channels omnia --add channels conda-forge
! conda install -y openforcefield

Now ready, let’s call openforcefield.

from openforcefield.topology import Molecule
from openforcefield.topology import Topology
from openforcefield.utils import RDKitToolkitWrapper
from openforcefield.typing.engines.smirnoff import ForceField
from simtk import openmm
from simtk import unit

Following code was borrowed from original repo.

def get_energy(system, positions):
    Return the potential energy.

    system : simtk.openmm.System
        The system to check
    positions : simtk.unit.Quantity of dimension (natoms,3) with units of length
        The positions to use

    integrator = openmm.VerletIntegrator(1.0 * unit.femtoseconds)
    context = openmm.Context(system, integrator)
    state = context.getState(getEnergy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    return energy

Load molecule with openFF method and calculate energy.

mol = Molecule.from_smiles('CCO')
positions = mol.conformers[0]
#load Force Field
ff = ForceField('smirnoff99Frosst.offxml')
topology = mol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
> 2.3399880921635527 kcal/mol

Next calculate energy from rdkit mol object.

mol = Chem.MolFromSmiles('c1ccccc1')
rdw = RDKitToolkitWrapper()
ofmol = rdw.from_rdkit(mol)
positions = ofmol.conformers[0]
topology = ofmol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
>7.987402593200566 kcal/mol

This is very simple example of OpenFF usage with RDKit. The package seems powerful for CADD. I would like to learn the package more and molecular dynamics.

Whole code of the post can check from my repo.

And I really thank to developer of RDKit, Openforcefield and many open source chemoinformatics tools.

HTS with micro-fluidic instrument and DNA encoded library. #DEL #HTS #memo

DNA Encoded Library (DEL) is one of powerful tool for HTS. Recently there are many CROs however the approach has limitations. It is limited to affinity-based hit identification. It is difficult to apply insoluble, instable targets. Today I read an interesting article reported by researchers from Roche and Scripps Research Institute. The URL is below.

Activity-Based DNA-Encoded Library Screening

The authors made DEL as ‘One-bead-one-comound’ library and the compounds and beads are linked with photo cleavable linker.

Figure1 in the article shows whole structure of their assay platform. They selected Autotaxin as a screening target. I had an interest in the assay platform. The assay of each compound is performed in a droplet. Laser-induced fluorescence assay detection was used and then hit droplets are separated by difference of voltage AC pulse.

You can see movie of the instrument in supporting information.

They prepared 67100 membered DEL and perform the miniaturized HTS campaign. The throughput of the assay was 30000 beads per hour! And got ~7000 hits with good Z'(0.71).

As a result of the screening the authors get some novel chemical series as hits.

Amazingly, it was required only 8h instrument time and consumed vanishing quantities of activity assay reagents (<10 μg ATX; < 20 nmol substrate) and library (0.0005 total library). Even if huge amount of compounds are assayed.

I think it is difficult to understand the technology from my post. I recommend to read the article.

Integration of cutting edge of science and technology produces new solution for drug discovery. I hope the technology accelerates drug discovery project for human health.

Make Graph convolution model with geometric deep learning extension library for PyTorch #RDKit #chemoinformatics #pytorch

In the chemoinformatics area, QSAR by using molecular graph as input is very hot topic. Examples of major implementations are deepchem and chainer-chemistry I think. I also have interest about Graph based QSAR model building. Recently I am using pytorch for my task of deeplearning so I would like to build model with pytorch. Fortunately very elegant package is provided for pytorch named ‘pytorch_geometric‘. PyTorch Geometric is a geometric deep learning extension library for PyTorch.

Today I tried to build GCN model with the package.

At first I defined function of mol to graph which convert molecule to graph vector. Most part of the code borrowed from DeepChem.

Data structure of torch_geometry is described in this URL. I defined molecular graph as undirected graph.

from __future__ import division
from __future__ import unicode_literals
import numpy as np
from rdkit import Chem
import multiprocessing
import logging
import torch
from import Data

# following code was borrowed from deepchem

def one_of_k_encoding(x, allowable_set):
  if x not in allowable_set:
    raise Exception("input {0} not in allowable set{1}:".format(
        x, allowable_set))
  return list(map(lambda s: x == s, allowable_set))

def one_of_k_encoding_unk(x, allowable_set):
  """Maps inputs not in the allowable set to the last element."""
  if x not in allowable_set:
    x = allowable_set[-1]
  return list(map(lambda s: x == s, allowable_set))

def get_intervals(l):
  """For list of lists, gets the cumulative products of the lengths"""
  intervals = len(l) * [0]
  # Initalize with 1
  intervals[0] = 1
  for k in range(1, len(l)):
    intervals[k] = (len(l[k]) + 1) * intervals[k - 1]

  return intervals

def safe_index(l, e):
  """Gets the index of e in l, providing an index of len(l) if not found"""
    return l.index(e)
    return len(l)

possible_atom_list = [
    'C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu',
    'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'
possible_numH_list = [0, 1, 2, 3, 4]
possible_valence_list = [0, 1, 2, 3, 4, 5, 6]
possible_formal_charge_list = [-3, -2, -1, 0, 1, 2, 3]
possible_hybridization_list = [
    Chem.rdchem.HybridizationType.SP, Chem.rdchem.HybridizationType.SP2,
    Chem.rdchem.HybridizationType.SP3, Chem.rdchem.HybridizationType.SP3D,
possible_number_radical_e_list = [0, 1, 2]
possible_chirality_list = ['R', 'S']

reference_lists = [
    possible_atom_list, possible_numH_list, possible_valence_list,
    possible_formal_charge_list, possible_number_radical_e_list,
    possible_hybridization_list, possible_chirality_list

intervals = get_intervals(reference_lists)

def get_feature_list(atom):
  features = 6 * [0]
  features[0] = safe_index(possible_atom_list, atom.GetSymbol())
  features[1] = safe_index(possible_numH_list, atom.GetTotalNumHs())
  features[2] = safe_index(possible_valence_list, atom.GetImplicitValence())
  features[3] = safe_index(possible_formal_charge_list, atom.GetFormalCharge())
  features[4] = safe_index(possible_number_radical_e_list,
  features[5] = safe_index(possible_hybridization_list, atom.GetHybridization())
  return features

def features_to_id(features, intervals):
  """Convert list of features into index using spacings provided in intervals"""
  id = 0
  for k in range(len(intervals)):
    id += features[k] * intervals[k]

  # Allow 0 index to correspond to null molecule 1
  id = id + 1
  return id

def id_to_features(id, intervals):
  features = 6 * [0]

  # Correct for null
  id -= 1

  for k in range(0, 6 - 1):
    # print(6-k-1, id)
    features[6 - k - 1] = id // intervals[6 - k - 1]
    id -= features[6 - k - 1] * intervals[6 - k - 1]
  # Correct for last one
  features[0] = id
  return features

def atom_to_id(atom):
  """Return a unique id corresponding to the atom type"""
  features = get_feature_list(atom)
  return features_to_id(features, intervals)

def atom_features(atom,
  if bool_id_feat:
    return np.array([atom_to_id(atom)])
    from rdkit import Chem
    results = one_of_k_encoding_unk(
        'H',  # H?
      ]) + one_of_k_encoding(atom.GetDegree(),
                             [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) + \
              one_of_k_encoding_unk(atom.GetImplicitValence(), [0, 1, 2, 3, 4, 5, 6]) + \
              [atom.GetFormalCharge(), atom.GetNumRadicalElectrons()] + \
              one_of_k_encoding_unk(atom.GetHybridization(), [
                Chem.rdchem.HybridizationType.SP, Chem.rdchem.HybridizationType.SP2,
                Chem.rdchem.HybridizationType.SP3, Chem.rdchem.HybridizationType.
                                    SP3D, Chem.rdchem.HybridizationType.SP3D2
              ]) + [atom.GetIsAromatic()]
    # In case of explicit hydrogen(QM8, QM9), avoid calling `GetTotalNumHs`
    if not explicit_H:
      results = results + one_of_k_encoding_unk(atom.GetTotalNumHs(),
                                                [0, 1, 2, 3, 4])
    if use_chirality:
        results = results + one_of_k_encoding_unk(
            ['R', 'S']) + [atom.HasProp('_ChiralityPossible')]
        results = results + [False, False
                            ] + [atom.HasProp('_ChiralityPossible')]

    return np.array(results)

def bond_features(bond, use_chirality=False):
  from rdkit import Chem
  bt = bond.GetBondType()
  bond_feats = [
      bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE,
      bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC,
  if use_chirality:
    bond_feats = bond_feats + one_of_k_encoding_unk(
  return np.array(bond_feats)

# pen added
def get_bond_pair(mol):
  bonds = mol.GetBonds()
  res = [[],[]]
  for bond in bonds:
    res[0] += [bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()]
    res[1] += [bond.GetEndAtomIdx(), bond.GetBeginAtomIdx()]
  return res

def mol2vec(mol):
  atoms = mol.GetAtoms()
  bonds = mol.GetBonds()
  node_f= [atom_features(atom) for atom in atoms]
  edge_index = get_bond_pair(mol)
  edge_attr = [bond_features(bond, use_chirality=False) for bond in bonds]
  for bond in bonds:
  data = Data(x=torch.tensor(node_f, dtype=torch.float),
              edge_index=torch.tensor(edge_index, dtype=torch.long),
  return data

Now finished define mol2graph. I tried to make GCN class. Following code is written on ipython-notebook. So I call some method for interactive visualization. At first load packages and load test data. I used solubility dataset which is provided from RDKit.

%matplotlib inline
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
import torch
import torch.nn.functional as F
from torch.nn import Linear
from torch.nn import BatchNorm1d
from import Dataset
from torch_geometric.nn import GCNConv
from torch_geometric.nn import ChebConv
from torch_geometric.nn import global_add_pool, global_mean_pool
from import DataLoader
from torch_scatter import scatter_mean
import mol2graph
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw"ggplot")

train_mols = [m for m in Chem.SDMolSupplier('solubility.train.sdf')]
test_mols = [m for m in Chem.SDMolSupplier('solubility.test.sdf')]
sol_cls_dict = {'(A) low':0, '(B) medium':1, '(C) high':2}

Then convert molecule to graph with the defined function and added label for training. Defined data loader next.

train_X = [mol2graph.mol2vec(m) for m in train_mols]
for i, data in enumerate(train_X):
    y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)

test_X = [mol2graph.mol2vec(m) for m in test_mols]
for i, data in enumerate(test_X):
    y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)
train_loader = DataLoader(train_X, batch_size=64, shuffle=True, drop_last=True)
test_loader = DataLoader(test_X, batch_size=64, shuffle=True, drop_last=True)

Then I defined model architecture for GCN. The implementation is very deferent for original article.

n_features = 75
# definenet
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = GCNConv(n_features, 128, cached=False) # if you defined cache=True, the shape of batch must be same!
        self.bn1 = BatchNorm1d(128)
        self.conv2 = GCNConv(128, 64, cached=False)
        self.bn2 = BatchNorm1d(64)
        self.fc1 = Linear(64, 64)
        self.bn3 = BatchNorm1d(64)
        self.fc2 = Linear(64, 64)
        self.fc3 = Linear(64, 3)
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = self.bn1(x)
        x = F.relu(self.conv2(x, edge_index))
        x = self.bn2(x)
        x = global_add_pool(x, data.batch)
        x = F.relu(self.fc1(x))
        x = self.bn3(x)
        x = F.relu(self.fc2(x))
        x = F.dropout(x, p=0.2,
        x = self.fc3(x)
        x = F.log_softmax(x, dim=1)
        return x       

After defined model, I tried to train the model and evaluate the performance.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
def train(epoch):
    loss_all = 0
    for data in train_loader:
        data =
        output = model(data)
        loss = F.nll_loss(output, data.y)
        loss_all += loss.item() * data.num_graphs
    return loss_all / len(train_X)
def test(loader):
    correct = 0
    for data in loader:
        data =
        output = model(data)
        pred = output.max(dim=1)[1]
        correct += pred.eq(data.y).sum().item()
    return correct / len(loader.dataset)
hist = {"loss":[], "acc":[], "test_acc":[]}
for epoch in range(1, 101):
    train_loss = train(epoch)
    train_acc = test(train_loader)
    test_acc = test(test_loader)
    print(f'Epoch: {epoch}, Train loss: {train_loss:.3}, Train_acc: {train_acc:.3}, Test_acc: {test_acc:.3}')
ax = plt.subplot(1,1,1)
ax.plot([e for e in range(1,101)], hist["loss"], label="train_loss")
ax.plot([e for e in range(1,101)], hist["acc"], label="train_acc")
ax.plot([e for e in range(1,101)], hist["test_acc"], label="test_acc")

After training, I could get following image.

It seems not so bad. Pytorch_geometry has many algorithms for graph based deep learning.

I think it is very cool and useful package for chemoinformatian. ;)

All code of the post is uploaded the URL below.

Enjoyed SPARQLTHON and weekend #memo

In this week I participated SPARQLTHON #78 where held on national institute of genetics.

I am not familiar to SPARQL so it was difficult to develop something with SPARQL, but I tried to use SPARQLWRAPPER which is package for using sparql from python and tried to retrieve information from ChEMBL RDF.

It was good opportunity for me because I could focus on coding and after the hackathon, I could have many fruitful discussion with other participants. I was surprised that traditional analytic method is more useful than deep learning in the area of genetics. Because physicochemical properties control genetics so found rules such as Mendere law are stronger than traditional machine learning methods. Deep learning is useful but I have to consider the case when it will be effective.

And this weekend I and my kids went to cherry-blossom viewing party. In Japan, current season is good for the event. There are many cherry-blossom. It is really beautiful and temperature becomes warmer. I like the season expect be suffering from a hay fever.

Here are some photos of cherry-blossom. Tomorrow is the beginning of a fiscal year. I will enjoy and do my best.