Thin book of chemoinformatics_en #chemoinformatcs

I really happy to having the opportunity for writing the thin book of chemoinformatics. I want to express my gratitude to @fmkz__.

The book is written in Japanese and I got some request from my follower about English translation. It is tough work for me, I thought it will take long time to do it. However, fortunately wonderful co-workers appeared! And the work goes well. Thank joofio and I am happy to work with him.

Today I just finished making English version PDF. Then I sent PR to the repo. https://github.com/joofio

Draft version can get from my github repo. https://github.com/iwatobipen/py4chemoinformatics/tree/master/pdf

I will be happy if it is useful for someone who is interested in chemoinformatics. The book has some sample jupyter notebook code. So readers can run the code on your PC.

By the way, I would like to introduce the girl who is in cover picture.

Her name is ‘Kusuri Murakumo’ called ‘Souyaku-chan’. She is a child of social drug discovery. ;)

Souyaku means drug discovery in Japanese. She organized drug discovery competition. It is very exciting event for us. Recently there are many open source tools for drug discovery. So if you want, you can do virtual screening on your private PC. Some years ago foldit solved complex protein structure with gamification technique. She tries same manner to discover new drug. It seems interesting isn’t it?

Advertisements

Handling chemoinformatics data with pandas #RDKit #chemoinformatics

I often use Pandas for data analysis. RDKit provides useful method named PandasTools. The method can load sdf and return data as pandas dataframe. By using dataframe, It isn’t needed to do something with for loop.

I found an interesting information in rdkit issues. A package named pyjanitor.

The package wraps pandas and provides useful method not only basic data handling but also chemoinformatics.

The package can be installed with conda from conda-forge channel. I used the library. Following code is example on my jupyter notebook. At first, read libraries for the trial.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDConfig

import pandas as pd
import janitor
from janitor import chemistry

from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA
plt.style.use('ggplot')

Then load sdf data which is provided from rdkit.

path = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf')
df = PandasTools.LoadSDF(path)

pyjanitor has chemoinformatics function, one is morgan_fingerprint. It can calculate data and add the result in current dataframe. To do it, it is very easy ;)

fp1=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='bits')

Now I can get fingerprint information as pandas dataframe which shape is (num of compounds, num of bits).

Conduct PCA with fingerprint.

fp2=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='counts')
pca = PCA(n_components=3)
pca_res = pca.fit_transform(fp2)

janitor.chemistry has smi2mol method. It converts SMILES to mol objects in pandas dataframe.

df['SMILES'] = df.ROMol.apply(Chem.MolToSmiles)
df.head(2)
df = chemistry.smiles2mol(df, smiles_col='SMILES', mols_col='newROMol', progressbar='notebook')

And janitor has add_column method. To use the method, I can add multiple columns with chain method.

# add_column function is not native pandas method.
# df['NumAtm'] = df.ROMol.apply(Chem.Mol.GetAtoms) 
df = df.add_column('NumAtm', [Chem.Mol.GetNumAtoms(mol) for mol in df.ROMol])

It is easy to make fingerprint dataframe with janitor. And pyjanitor has other useful methods for pandas data handling.

code
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/pyjanitor_test.ipynb
pyjanitor
https://pyjanitor.readthedocs.io/index.html


2019度初試合 #dodgeball

日曜は長男は所属しているドッジボールチームの他県チームとの練習試合でした。

今回は6年生までの子供たちで構成されるオフィシャルという高学年チームでの試合です。うちの長男はこの四月3年になったばかり。新6年生のエースのボールはやっぱり速い、オフィシャルの公式球はジュニア(4年生までの子供で構成される)の公式球より大きく、重たいので威力があるボールが飛んでくると結構迫力あります。投げるのも慣れないでずっとやってると普通に肩痛めますw。

所属チームも高学年のエースアタッカーが攻撃の主力なのでうちの子は基本ディフェンス要員です。今回の練習試合では、全試合ほとんど起用してもらえて、相手エースの球を頑張ってキャッチしていました。入団当初は練習で高学年の球を思いっきり受けて取れずに泣くこともありましたが、随分と成長したなとちょっとびっくり。

試合の後も楽しかったと嬉しそうに話しかけてくれてまだまだ上手くなれるなと思ったわけです(笑)。

競技ドッジボールは1試合五分ですがスピード感があり面白いスポーツです。まだまだこれからも長いので、まずは怪我なく楽しんでやってほしいです。

私も負けてられないな、、、

New version of openforcefield supports RDKit! #RDKit #chemoinformatics #openforcefield

In this morning I found great news on my tiwtter TL that openforcefield supports RDKit! Older version of openforcefiled is required openeyeTK which is commercial license for industry. I had interested in the package but could not install because I’m not academia and our company does not OETK license now.

https://open-forcefield-toolkit.readthedocs.io/en/latest/releasehistory.html

I can’t wait to use it so I installed the package and check it immediately.
Following code is ran on google colab. If would like to run on local env, you can skip several steps.

# install conda
!wget https://repo.continuum.io/miniconda/Miniconda3-4.4.10-Linux-x86_64.sh
!chmod ./Miniconda3-4.4.10-Linux-x86_64.sh
!time bash ./Miniconda3-4.4.10-Linux-x86_64.sh -b -f -p /usr/local
# install rdkit via conda command
!time conda install -c conda-forge rdkit -y

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)

Then install openforcefield. Openforcefield is required openmm. Cuda version of colab was 10.0 so I did not need change of openmm install version. So, just type install ‘conda install openforcefield’. ;)

! conda config --add channels omnia --add channels conda-forge
! conda install -y openforcefield

Now ready, let’s call openforcefield.

from openforcefield.topology import Molecule
from openforcefield.topology import Topology
from openforcefield.utils import RDKitToolkitWrapper
from openforcefield.typing.engines.smirnoff import ForceField
from simtk import openmm
from simtk import unit

Following code was borrowed from original repo.

def get_energy(system, positions):
    """
    Return the potential energy.

    Parameters
    ----------
    system : simtk.openmm.System
        The system to check
    positions : simtk.unit.Quantity of dimension (natoms,3) with units of length
        The positions to use
    Returns
    ---------
    energy
    """

    integrator = openmm.VerletIntegrator(1.0 * unit.femtoseconds)
    context = openmm.Context(system, integrator)
    context.setPositions(positions)
    state = context.getState(getEnergy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    return energy

Load molecule with openFF method and calculate energy.

mol = Molecule.from_smiles('CCO')
mol.generate_conformers()
positions = mol.conformers[0]
#load Force Field
ff = ForceField('smirnoff99Frosst.offxml')
topology = mol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
print(orig_energy)
> 2.3399880921635527 kcal/mol

Next calculate energy from rdkit mol object.

mol = Chem.MolFromSmiles('c1ccccc1')
rdw = RDKitToolkitWrapper()
ofmol = rdw.from_rdkit(mol)
ofmol.generate_conformers()
positions = ofmol.conformers[0]
topology = ofmol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
print(orig_energy)
>7.987402593200566 kcal/mol

This is very simple example of OpenFF usage with RDKit. The package seems powerful for CADD. I would like to learn the package more and molecular dynamics.

Whole code of the post can check from my repo.

And I really thank to developer of RDKit, Openforcefield and many open source chemoinformatics tools.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/OpenFF_test.ipynb

HTS with micro-fluidic instrument and DNA encoded library. #DEL #HTS #memo

DNA Encoded Library (DEL) is one of powerful tool for HTS. Recently there are many CROs however the approach has limitations. It is limited to affinity-based hit identification. It is difficult to apply insoluble, instable targets. Today I read an interesting article reported by researchers from Roche and Scripps Research Institute. The URL is below.

Activity-Based DNA-Encoded Library Screening

The authors made DEL as ‘One-bead-one-comound’ library and the compounds and beads are linked with photo cleavable linker.

Figure1 in the article shows whole structure of their assay platform. They selected Autotaxin as a screening target. I had an interest in the assay platform. The assay of each compound is performed in a droplet. Laser-induced fluorescence assay detection was used and then hit droplets are separated by difference of voltage AC pulse.

You can see movie of the instrument in supporting information.

They prepared 67100 membered DEL and perform the miniaturized HTS campaign. The throughput of the assay was 30000 beads per hour! And got ~7000 hits with good Z'(0.71).

As a result of the screening the authors get some novel chemical series as hits.

Amazingly, it was required only 8h instrument time and consumed vanishing quantities of activity assay reagents (<10 μg ATX; < 20 nmol substrate) and library (0.0005 total library). Even if huge amount of compounds are assayed.

I think it is difficult to understand the technology from my post. I recommend to read the article.

Integration of cutting edge of science and technology produces new solution for drug discovery. I hope the technology accelerates drug discovery project for human health.

Make Graph convolution model with geometric deep learning extension library for PyTorch #RDKit #chemoinformatics #pytorch

In the chemoinformatics area, QSAR by using molecular graph as input is very hot topic. Examples of major implementations are deepchem and chainer-chemistry I think. I also have interest about Graph based QSAR model building. Recently I am using pytorch for my task of deeplearning so I would like to build model with pytorch. Fortunately very elegant package is provided for pytorch named ‘pytorch_geometric‘. PyTorch Geometric is a geometric deep learning extension library for PyTorch.

Today I tried to build GCN model with the package.

At first I defined function of mol to graph which convert molecule to graph vector. Most part of the code borrowed from DeepChem.

Data structure of torch_geometry is described in this URL. I defined molecular graph as undirected graph.

from __future__ import division
from __future__ import unicode_literals
import numpy as np
from rdkit import Chem
import multiprocessing
import logging
import torch
from torch_geometric.data import Data

# following code was borrowed from deepchem
# https://raw.githubusercontent.com/deepchem/deepchem/master/deepchem/feat/graph_features.py


def one_of_k_encoding(x, allowable_set):
  if x not in allowable_set:
    raise Exception("input {0} not in allowable set{1}:".format(
        x, allowable_set))
  return list(map(lambda s: x == s, allowable_set))


def one_of_k_encoding_unk(x, allowable_set):
  """Maps inputs not in the allowable set to the last element."""
  if x not in allowable_set:
    x = allowable_set[-1]
  return list(map(lambda s: x == s, allowable_set))


def get_intervals(l):
  """For list of lists, gets the cumulative products of the lengths"""
  intervals = len(l) * [0]
  # Initalize with 1
  intervals[0] = 1
  for k in range(1, len(l)):
    intervals[k] = (len(l[k]) + 1) * intervals[k - 1]

  return intervals


def safe_index(l, e):
  """Gets the index of e in l, providing an index of len(l) if not found"""
  try:
    return l.index(e)
  except:
    return len(l)


possible_atom_list = [
    'C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu',
    'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'
]
possible_numH_list = [0, 1, 2, 3, 4]
possible_valence_list = [0, 1, 2, 3, 4, 5, 6]
possible_formal_charge_list = [-3, -2, -1, 0, 1, 2, 3]
possible_hybridization_list = [
    Chem.rdchem.HybridizationType.SP, Chem.rdchem.HybridizationType.SP2,
    Chem.rdchem.HybridizationType.SP3, Chem.rdchem.HybridizationType.SP3D,
    Chem.rdchem.HybridizationType.SP3D2
]
possible_number_radical_e_list = [0, 1, 2]
possible_chirality_list = ['R', 'S']

reference_lists = [
    possible_atom_list, possible_numH_list, possible_valence_list,
    possible_formal_charge_list, possible_number_radical_e_list,
    possible_hybridization_list, possible_chirality_list
]

intervals = get_intervals(reference_lists)


def get_feature_list(atom):
  features = 6 * [0]
  features[0] = safe_index(possible_atom_list, atom.GetSymbol())
  features[1] = safe_index(possible_numH_list, atom.GetTotalNumHs())
  features[2] = safe_index(possible_valence_list, atom.GetImplicitValence())
  features[3] = safe_index(possible_formal_charge_list, atom.GetFormalCharge())
  features[4] = safe_index(possible_number_radical_e_list,
                           atom.GetNumRadicalElectrons())
  features[5] = safe_index(possible_hybridization_list, atom.GetHybridization())
  return features


def features_to_id(features, intervals):
  """Convert list of features into index using spacings provided in intervals"""
  id = 0
  for k in range(len(intervals)):
    id += features[k] * intervals[k]

  # Allow 0 index to correspond to null molecule 1
  id = id + 1
  return id


def id_to_features(id, intervals):
  features = 6 * [0]

  # Correct for null
  id -= 1

  for k in range(0, 6 - 1):
    # print(6-k-1, id)
    features[6 - k - 1] = id // intervals[6 - k - 1]
    id -= features[6 - k - 1] * intervals[6 - k - 1]
  # Correct for last one
  features[0] = id
  return features


def atom_to_id(atom):
  """Return a unique id corresponding to the atom type"""
  features = get_feature_list(atom)
  return features_to_id(features, intervals)


def atom_features(atom,
                  bool_id_feat=False,
                  explicit_H=False,
                  use_chirality=False):
  if bool_id_feat:
    return np.array([atom_to_id(atom)])
  else:
    from rdkit import Chem
    results = one_of_k_encoding_unk(
      atom.GetSymbol(),
      [
        'C',
        'N',
        'O',
        'S',
        'F',
        'Si',
        'P',
        'Cl',
        'Br',
        'Mg',
        'Na',
        'Ca',
        'Fe',
        'As',
        'Al',
        'I',
        'B',
        'V',
        'K',
        'Tl',
        'Yb',
        'Sb',
        'Sn',
        'Ag',
        'Pd',
        'Co',
        'Se',
        'Ti',
        'Zn',
        'H',  # H?
        'Li',
        'Ge',
        'Cu',
        'Au',
        'Ni',
        'Cd',
        'In',
        'Mn',
        'Zr',
        'Cr',
        'Pt',
        'Hg',
        'Pb',
        'Unknown'
      ]) + one_of_k_encoding(atom.GetDegree(),
                             [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) + \
              one_of_k_encoding_unk(atom.GetImplicitValence(), [0, 1, 2, 3, 4, 5, 6]) + \
              [atom.GetFormalCharge(), atom.GetNumRadicalElectrons()] + \
              one_of_k_encoding_unk(atom.GetHybridization(), [
                Chem.rdchem.HybridizationType.SP, Chem.rdchem.HybridizationType.SP2,
                Chem.rdchem.HybridizationType.SP3, Chem.rdchem.HybridizationType.
                                    SP3D, Chem.rdchem.HybridizationType.SP3D2
              ]) + [atom.GetIsAromatic()]
    # In case of explicit hydrogen(QM8, QM9), avoid calling `GetTotalNumHs`
    if not explicit_H:
      results = results + one_of_k_encoding_unk(atom.GetTotalNumHs(),
                                                [0, 1, 2, 3, 4])
    if use_chirality:
      try:
        results = results + one_of_k_encoding_unk(
            atom.GetProp('_CIPCode'),
            ['R', 'S']) + [atom.HasProp('_ChiralityPossible')]
      except:
        results = results + [False, False
                            ] + [atom.HasProp('_ChiralityPossible')]

    return np.array(results)


def bond_features(bond, use_chirality=False):
  from rdkit import Chem
  bt = bond.GetBondType()
  bond_feats = [
      bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE,
      bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC,
      bond.GetIsConjugated(),
      bond.IsInRing()
  ]
  if use_chirality:
    bond_feats = bond_feats + one_of_k_encoding_unk(
        str(bond.GetStereo()),
        ["STEREONONE", "STEREOANY", "STEREOZ", "STEREOE"])
  return np.array(bond_feats)

#################
# pen added
#################
def get_bond_pair(mol):
  bonds = mol.GetBonds()
  res = [[],[]]
  for bond in bonds:
    res[0] += [bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()]
    res[1] += [bond.GetEndAtomIdx(), bond.GetBeginAtomIdx()]
  return res

def mol2vec(mol):
  atoms = mol.GetAtoms()
  bonds = mol.GetBonds()
  node_f= [atom_features(atom) for atom in atoms]
  edge_index = get_bond_pair(mol)
  edge_attr = [bond_features(bond, use_chirality=False) for bond in bonds]
  for bond in bonds:
    edge_attr.append(bond_features(bond))
  data = Data(x=torch.tensor(node_f, dtype=torch.float),
              edge_index=torch.tensor(edge_index, dtype=torch.long),
              edge_attr=torch.tensor(edge_attr,dtype=torch.float)
              )
  return data

Now finished define mol2graph. I tried to make GCN class. Following code is written on ipython-notebook. So I call some method for interactive visualization. At first load packages and load test data. I used solubility dataset which is provided from RDKit.

%matplotlib inline
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
import torch
import torch.nn.functional as F
from torch.nn import Linear
from torch.nn import BatchNorm1d
from torch.utils.data import Dataset
from torch_geometric.nn import GCNConv
from torch_geometric.nn import ChebConv
from torch_geometric.nn import global_add_pool, global_mean_pool
from torch_geometric.data import DataLoader
from torch_scatter import scatter_mean
import mol2graph
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

plt.style.use("ggplot")

train_mols = [m for m in Chem.SDMolSupplier('solubility.train.sdf')]
test_mols = [m for m in Chem.SDMolSupplier('solubility.test.sdf')]
sol_cls_dict = {'(A) low':0, '(B) medium':1, '(C) high':2}

Then convert molecule to graph with the defined function and added label for training. Defined data loader next.

train_X = [mol2graph.mol2vec(m) for m in train_mols]
for i, data in enumerate(train_X):
    y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)

test_X = [mol2graph.mol2vec(m) for m in test_mols]
for i, data in enumerate(test_X):
    y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)
train_loader = DataLoader(train_X, batch_size=64, shuffle=True, drop_last=True)
test_loader = DataLoader(test_X, batch_size=64, shuffle=True, drop_last=True)

Then I defined model architecture for GCN. The implementation is very deferent for original article.

n_features = 75
# definenet
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = GCNConv(n_features, 128, cached=False) # if you defined cache=True, the shape of batch must be same!
        self.bn1 = BatchNorm1d(128)
        self.conv2 = GCNConv(128, 64, cached=False)
        self.bn2 = BatchNorm1d(64)
        self.fc1 = Linear(64, 64)
        self.bn3 = BatchNorm1d(64)
        self.fc2 = Linear(64, 64)
        self.fc3 = Linear(64, 3)
        
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = self.bn1(x)
        x = F.relu(self.conv2(x, edge_index))
        x = self.bn2(x)
        x = global_add_pool(x, data.batch)
        x = F.relu(self.fc1(x))
        x = self.bn3(x)
        x = F.relu(self.fc2(x))
        x = F.dropout(x, p=0.2, training=self.training)
        x = self.fc3(x)
        x = F.log_softmax(x, dim=1)
        return x       

After defined model, I tried to train the model and evaluate the performance.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
def train(epoch):
    model.train()
    loss_all = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, data.y)
        loss.backward()
        loss_all += loss.item() * data.num_graphs
        optimizer.step()
    return loss_all / len(train_X)
def test(loader):
    model.eval()
    correct = 0
    for data in loader:
        data = data.to(device)
        output = model(data)
        pred = output.max(dim=1)[1]
        correct += pred.eq(data.y).sum().item()
    return correct / len(loader.dataset)
hist = {"loss":[], "acc":[], "test_acc":[]}
for epoch in range(1, 101):
    train_loss = train(epoch)
    train_acc = test(train_loader)
    test_acc = test(test_loader)
    hist["loss"].append(train_loss)
    hist["acc"].append(train_acc)
    hist["test_acc"].append(test_acc)
    print(f'Epoch: {epoch}, Train loss: {train_loss:.3}, Train_acc: {train_acc:.3}, Test_acc: {test_acc:.3}')
ax = plt.subplot(1,1,1)
ax.plot([e for e in range(1,101)], hist["loss"], label="train_loss")
ax.plot([e for e in range(1,101)], hist["acc"], label="train_acc")
ax.plot([e for e in range(1,101)], hist["test_acc"], label="test_acc")
plt.xlabel("epoch")
ax.legend()

After training, I could get following image.

It seems not so bad. Pytorch_geometry has many algorithms for graph based deep learning.

I think it is very cool and useful package for chemoinformatian. ;)

All code of the post is uploaded the URL below.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/gcn.ipynb

Enjoyed SPARQLTHON and weekend #memo

In this week I participated SPARQLTHON #78 where held on national institute of genetics.

I am not familiar to SPARQL so it was difficult to develop something with SPARQL, but I tried to use SPARQLWRAPPER which is package for using sparql from python and tried to retrieve information from ChEMBL RDF.

It was good opportunity for me because I could focus on coding and after the hackathon, I could have many fruitful discussion with other participants. I was surprised that traditional analytic method is more useful than deep learning in the area of genetics. Because physicochemical properties control genetics so found rules such as Mendere law are stronger than traditional machine learning methods. Deep learning is useful but I have to consider the case when it will be effective.

And this weekend I and my kids went to cherry-blossom viewing party. In Japan, current season is good for the event. There are many cherry-blossom. It is really beautiful and temperature becomes warmer. I like the season expect be suffering from a hay fever.

Here are some photos of cherry-blossom. Tomorrow is the beginning of a fiscal year. I will enjoy and do my best.