Handling chemoinformatics data with pandas #RDKit #chemoinformatics

I often use Pandas for data analysis. RDKit provides useful method named PandasTools. The method can load sdf and return data as pandas dataframe. By using dataframe, It isn’t needed to do something with for loop.

I found an interesting information in rdkit issues. A package named pyjanitor.

The package wraps pandas and provides useful method not only basic data handling but also chemoinformatics.

The package can be installed with conda from conda-forge channel. I used the library. Following code is example on my jupyter notebook. At first, read libraries for the trial.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDConfig

import pandas as pd
import janitor
from janitor import chemistry

from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA
plt.style.use('ggplot')

Then load sdf data which is provided from rdkit.

path = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf')
df = PandasTools.LoadSDF(path)

pyjanitor has chemoinformatics function, one is morgan_fingerprint. It can calculate data and add the result in current dataframe. To do it, it is very easy ;)

fp1=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='bits')

Now I can get fingerprint information as pandas dataframe which shape is (num of compounds, num of bits).

Conduct PCA with fingerprint.

fp2=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='counts')
pca = PCA(n_components=3)
pca_res = pca.fit_transform(fp2)

janitor.chemistry has smi2mol method. It converts SMILES to mol objects in pandas dataframe.

df['SMILES'] = df.ROMol.apply(Chem.MolToSmiles)
df.head(2)
df = chemistry.smiles2mol(df, smiles_col='SMILES', mols_col='newROMol', progressbar='notebook')

And janitor has add_column method. To use the method, I can add multiple columns with chain method.

# add_column function is not native pandas method.
# df['NumAtm'] = df.ROMol.apply(Chem.Mol.GetAtoms) 
df = df.add_column('NumAtm', [Chem.Mol.GetNumAtoms(mol) for mol in df.ROMol])

It is easy to make fingerprint dataframe with janitor. And pyjanitor has other useful methods for pandas data handling.

code
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/pyjanitor_test.ipynb
pyjanitor
https://pyjanitor.readthedocs.io/index.html


Advertisements

New version of openforcefield supports RDKit! #RDKit #chemoinformatics #openforcefield

In this morning I found great news on my tiwtter TL that openforcefield supports RDKit! Older version of openforcefiled is required openeyeTK which is commercial license for industry. I had interested in the package but could not install because I’m not academia and our company does not OETK license now.

https://open-forcefield-toolkit.readthedocs.io/en/latest/releasehistory.html

I can’t wait to use it so I installed the package and check it immediately.
Following code is ran on google colab. If would like to run on local env, you can skip several steps.

# install conda
!wget https://repo.continuum.io/miniconda/Miniconda3-4.4.10-Linux-x86_64.sh
!chmod ./Miniconda3-4.4.10-Linux-x86_64.sh
!time bash ./Miniconda3-4.4.10-Linux-x86_64.sh -b -f -p /usr/local
# install rdkit via conda command
!time conda install -c conda-forge rdkit -y

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
from rdkit import Chem
from rdkit import rdBase
print(rdBase.rdkitVersion)

Then install openforcefield. Openforcefield is required openmm. Cuda version of colab was 10.0 so I did not need change of openmm install version. So, just type install ‘conda install openforcefield’. ;)

! conda config --add channels omnia --add channels conda-forge
! conda install -y openforcefield

Now ready, let’s call openforcefield.

from openforcefield.topology import Molecule
from openforcefield.topology import Topology
from openforcefield.utils import RDKitToolkitWrapper
from openforcefield.typing.engines.smirnoff import ForceField
from simtk import openmm
from simtk import unit

Following code was borrowed from original repo.

def get_energy(system, positions):
    """
    Return the potential energy.

    Parameters
    ----------
    system : simtk.openmm.System
        The system to check
    positions : simtk.unit.Quantity of dimension (natoms,3) with units of length
        The positions to use
    Returns
    ---------
    energy
    """

    integrator = openmm.VerletIntegrator(1.0 * unit.femtoseconds)
    context = openmm.Context(system, integrator)
    context.setPositions(positions)
    state = context.getState(getEnergy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    return energy

Load molecule with openFF method and calculate energy.

mol = Molecule.from_smiles('CCO')
mol.generate_conformers()
positions = mol.conformers[0]
#load Force Field
ff = ForceField('smirnoff99Frosst.offxml')
topology = mol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
print(orig_energy)
> 2.3399880921635527 kcal/mol

Next calculate energy from rdkit mol object.

mol = Chem.MolFromSmiles('c1ccccc1')
rdw = RDKitToolkitWrapper()
ofmol = rdw.from_rdkit(mol)
ofmol.generate_conformers()
positions = ofmol.conformers[0]
topology = ofmol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
print(orig_energy)
>7.987402593200566 kcal/mol

This is very simple example of OpenFF usage with RDKit. The package seems powerful for CADD. I would like to learn the package more and molecular dynamics.

Whole code of the post can check from my repo.

And I really thank to developer of RDKit, Openforcefield and many open source chemoinformatics tools.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/OpenFF_test.ipynb

Make Graph convolution model with geometric deep learning extension library for PyTorch #RDKit #chemoinformatics #pytorch

In the chemoinformatics area, QSAR by using molecular graph as input is very hot topic. Examples of major implementations are deepchem and chainer-chemistry I think. I also have interest about Graph based QSAR model building. Recently I am using pytorch for my task of deeplearning so I would like to build model with pytorch. Fortunately very elegant package is provided for pytorch named ‘pytorch_geometric‘. PyTorch Geometric is a geometric deep learning extension library for PyTorch.

Today I tried to build GCN model with the package.

At first I defined function of mol to graph which convert molecule to graph vector. Most part of the code borrowed from DeepChem.

Data structure of torch_geometry is described in this URL. I defined molecular graph as undirected graph.

from __future__ import division
from __future__ import unicode_literals
import numpy as np
from rdkit import Chem
import multiprocessing
import logging
import torch
from torch_geometric.data import Data

# following code was borrowed from deepchem
# https://raw.githubusercontent.com/deepchem/deepchem/master/deepchem/feat/graph_features.py


def one_of_k_encoding(x, allowable_set):
  if x not in allowable_set:
    raise Exception("input {0} not in allowable set{1}:".format(
        x, allowable_set))
  return list(map(lambda s: x == s, allowable_set))


def one_of_k_encoding_unk(x, allowable_set):
  """Maps inputs not in the allowable set to the last element."""
  if x not in allowable_set:
    x = allowable_set[-1]
  return list(map(lambda s: x == s, allowable_set))


def get_intervals(l):
  """For list of lists, gets the cumulative products of the lengths"""
  intervals = len(l) * [0]
  # Initalize with 1
  intervals[0] = 1
  for k in range(1, len(l)):
    intervals[k] = (len(l[k]) + 1) * intervals[k - 1]

  return intervals


def safe_index(l, e):
  """Gets the index of e in l, providing an index of len(l) if not found"""
  try:
    return l.index(e)
  except:
    return len(l)


possible_atom_list = [
    'C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu',
    'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'
]
possible_numH_list = [0, 1, 2, 3, 4]
possible_valence_list = [0, 1, 2, 3, 4, 5, 6]
possible_formal_charge_list = [-3, -2, -1, 0, 1, 2, 3]
possible_hybridization_list = [
    Chem.rdchem.HybridizationType.SP, Chem.rdchem.HybridizationType.SP2,
    Chem.rdchem.HybridizationType.SP3, Chem.rdchem.HybridizationType.SP3D,
    Chem.rdchem.HybridizationType.SP3D2
]
possible_number_radical_e_list = [0, 1, 2]
possible_chirality_list = ['R', 'S']

reference_lists = [
    possible_atom_list, possible_numH_list, possible_valence_list,
    possible_formal_charge_list, possible_number_radical_e_list,
    possible_hybridization_list, possible_chirality_list
]

intervals = get_intervals(reference_lists)


def get_feature_list(atom):
  features = 6 * [0]
  features[0] = safe_index(possible_atom_list, atom.GetSymbol())
  features[1] = safe_index(possible_numH_list, atom.GetTotalNumHs())
  features[2] = safe_index(possible_valence_list, atom.GetImplicitValence())
  features[3] = safe_index(possible_formal_charge_list, atom.GetFormalCharge())
  features[4] = safe_index(possible_number_radical_e_list,
                           atom.GetNumRadicalElectrons())
  features[5] = safe_index(possible_hybridization_list, atom.GetHybridization())
  return features


def features_to_id(features, intervals):
  """Convert list of features into index using spacings provided in intervals"""
  id = 0
  for k in range(len(intervals)):
    id += features[k] * intervals[k]

  # Allow 0 index to correspond to null molecule 1
  id = id + 1
  return id


def id_to_features(id, intervals):
  features = 6 * [0]

  # Correct for null
  id -= 1

  for k in range(0, 6 - 1):
    # print(6-k-1, id)
    features[6 - k - 1] = id // intervals[6 - k - 1]
    id -= features[6 - k - 1] * intervals[6 - k - 1]
  # Correct for last one
  features[0] = id
  return features


def atom_to_id(atom):
  """Return a unique id corresponding to the atom type"""
  features = get_feature_list(atom)
  return features_to_id(features, intervals)


def atom_features(atom,
                  bool_id_feat=False,
                  explicit_H=False,
                  use_chirality=False):
  if bool_id_feat:
    return np.array([atom_to_id(atom)])
  else:
    from rdkit import Chem
    results = one_of_k_encoding_unk(
      atom.GetSymbol(),
      [
        'C',
        'N',
        'O',
        'S',
        'F',
        'Si',
        'P',
        'Cl',
        'Br',
        'Mg',
        'Na',
        'Ca',
        'Fe',
        'As',
        'Al',
        'I',
        'B',
        'V',
        'K',
        'Tl',
        'Yb',
        'Sb',
        'Sn',
        'Ag',
        'Pd',
        'Co',
        'Se',
        'Ti',
        'Zn',
        'H',  # H?
        'Li',
        'Ge',
        'Cu',
        'Au',
        'Ni',
        'Cd',
        'In',
        'Mn',
        'Zr',
        'Cr',
        'Pt',
        'Hg',
        'Pb',
        'Unknown'
      ]) + one_of_k_encoding(atom.GetDegree(),
                             [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) + \
              one_of_k_encoding_unk(atom.GetImplicitValence(), [0, 1, 2, 3, 4, 5, 6]) + \
              [atom.GetFormalCharge(), atom.GetNumRadicalElectrons()] + \
              one_of_k_encoding_unk(atom.GetHybridization(), [
                Chem.rdchem.HybridizationType.SP, Chem.rdchem.HybridizationType.SP2,
                Chem.rdchem.HybridizationType.SP3, Chem.rdchem.HybridizationType.
                                    SP3D, Chem.rdchem.HybridizationType.SP3D2
              ]) + [atom.GetIsAromatic()]
    # In case of explicit hydrogen(QM8, QM9), avoid calling `GetTotalNumHs`
    if not explicit_H:
      results = results + one_of_k_encoding_unk(atom.GetTotalNumHs(),
                                                [0, 1, 2, 3, 4])
    if use_chirality:
      try:
        results = results + one_of_k_encoding_unk(
            atom.GetProp('_CIPCode'),
            ['R', 'S']) + [atom.HasProp('_ChiralityPossible')]
      except:
        results = results + [False, False
                            ] + [atom.HasProp('_ChiralityPossible')]

    return np.array(results)


def bond_features(bond, use_chirality=False):
  from rdkit import Chem
  bt = bond.GetBondType()
  bond_feats = [
      bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE,
      bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC,
      bond.GetIsConjugated(),
      bond.IsInRing()
  ]
  if use_chirality:
    bond_feats = bond_feats + one_of_k_encoding_unk(
        str(bond.GetStereo()),
        ["STEREONONE", "STEREOANY", "STEREOZ", "STEREOE"])
  return np.array(bond_feats)

#################
# pen added
#################
def get_bond_pair(mol):
  bonds = mol.GetBonds()
  res = [[],[]]
  for bond in bonds:
    res[0] += [bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()]
    res[1] += [bond.GetEndAtomIdx(), bond.GetBeginAtomIdx()]
  return res

def mol2vec(mol):
  atoms = mol.GetAtoms()
  bonds = mol.GetBonds()
  node_f= [atom_features(atom) for atom in atoms]
  edge_index = get_bond_pair(mol)
  edge_attr = [bond_features(bond, use_chirality=False) for bond in bonds]
  for bond in bonds:
    edge_attr.append(bond_features(bond))
  data = Data(x=torch.tensor(node_f, dtype=torch.float),
              edge_index=torch.tensor(edge_index, dtype=torch.long),
              edge_attr=torch.tensor(edge_attr,dtype=torch.float)
              )
  return data

Now finished define mol2graph. I tried to make GCN class. Following code is written on ipython-notebook. So I call some method for interactive visualization. At first load packages and load test data. I used solubility dataset which is provided from RDKit.

%matplotlib inline
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
import torch
import torch.nn.functional as F
from torch.nn import Linear
from torch.nn import BatchNorm1d
from torch.utils.data import Dataset
from torch_geometric.nn import GCNConv
from torch_geometric.nn import ChebConv
from torch_geometric.nn import global_add_pool, global_mean_pool
from torch_geometric.data import DataLoader
from torch_scatter import scatter_mean
import mol2graph
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

plt.style.use("ggplot")

train_mols = [m for m in Chem.SDMolSupplier('solubility.train.sdf')]
test_mols = [m for m in Chem.SDMolSupplier('solubility.test.sdf')]
sol_cls_dict = {'(A) low':0, '(B) medium':1, '(C) high':2}

Then convert molecule to graph with the defined function and added label for training. Defined data loader next.

train_X = [mol2graph.mol2vec(m) for m in train_mols]
for i, data in enumerate(train_X):
    y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)

test_X = [mol2graph.mol2vec(m) for m in test_mols]
for i, data in enumerate(test_X):
    y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)
train_loader = DataLoader(train_X, batch_size=64, shuffle=True, drop_last=True)
test_loader = DataLoader(test_X, batch_size=64, shuffle=True, drop_last=True)

Then I defined model architecture for GCN. The implementation is very deferent for original article.

n_features = 75
# definenet
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = GCNConv(n_features, 128, cached=False) # if you defined cache=True, the shape of batch must be same!
        self.bn1 = BatchNorm1d(128)
        self.conv2 = GCNConv(128, 64, cached=False)
        self.bn2 = BatchNorm1d(64)
        self.fc1 = Linear(64, 64)
        self.bn3 = BatchNorm1d(64)
        self.fc2 = Linear(64, 64)
        self.fc3 = Linear(64, 3)
        
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = self.bn1(x)
        x = F.relu(self.conv2(x, edge_index))
        x = self.bn2(x)
        x = global_add_pool(x, data.batch)
        x = F.relu(self.fc1(x))
        x = self.bn3(x)
        x = F.relu(self.fc2(x))
        x = F.dropout(x, p=0.2, training=self.training)
        x = self.fc3(x)
        x = F.log_softmax(x, dim=1)
        return x       

After defined model, I tried to train the model and evaluate the performance.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
def train(epoch):
    model.train()
    loss_all = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, data.y)
        loss.backward()
        loss_all += loss.item() * data.num_graphs
        optimizer.step()
    return loss_all / len(train_X)
def test(loader):
    model.eval()
    correct = 0
    for data in loader:
        data = data.to(device)
        output = model(data)
        pred = output.max(dim=1)[1]
        correct += pred.eq(data.y).sum().item()
    return correct / len(loader.dataset)
hist = {"loss":[], "acc":[], "test_acc":[]}
for epoch in range(1, 101):
    train_loss = train(epoch)
    train_acc = test(train_loader)
    test_acc = test(test_loader)
    hist["loss"].append(train_loss)
    hist["acc"].append(train_acc)
    hist["test_acc"].append(test_acc)
    print(f'Epoch: {epoch}, Train loss: {train_loss:.3}, Train_acc: {train_acc:.3}, Test_acc: {test_acc:.3}')
ax = plt.subplot(1,1,1)
ax.plot([e for e in range(1,101)], hist["loss"], label="train_loss")
ax.plot([e for e in range(1,101)], hist["acc"], label="train_acc")
ax.plot([e for e in range(1,101)], hist["test_acc"], label="test_acc")
plt.xlabel("epoch")
ax.legend()

After training, I could get following image.

It seems not so bad. Pytorch_geometry has many algorithms for graph based deep learning.

I think it is very cool and useful package for chemoinformatian. ;)

All code of the post is uploaded the URL below.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/gcn.ipynb

Generate possible heteroaromatic cores from query molecule #RDKit #chemoinformatics

Hetero shuffling is the approach which replace atoms of scaffold and generate new molecule with atom replaced scaffold. For example benzene as core, examples of shuffled cores will be pyridine, pyrimidine etc.

The approach is often used medicinal chemistry project to improve ADMET properties, biological activities and also used for substance patent claim strategy. Native RDKit can’t do it directly. So I would like to code to do that.

Let’s try. At first always import required packages. Following code wrote on Jupyter-Notebook.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import Fragments
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
import numpy as np
import copy
from rdkit.Chem import AllChem
import itertools
rdBase.DisableLog('rdApp.err

Then I defined function named checkmol which extract atoms which is not inringsize 5 and check atom symbol is not sulphur or oxygen. Because normally S or O contained six membered aromatic ring is not used for building blocks. And my code generates all combination of aromatic atoms, so I need the check function.

def checkmol(mol):
    arom_atoms = mol.GetAromaticAtoms()
    symbols = [atom.GetSymbol() for atom in arom_atoms if not atom.IsInRingSize(5)]
    if symbols == []:
        return True
    elif 'O' in symbols or 'S' in symbols:
        return False
    else:
        return True

Next I defined main function. HeteroShuffle class is required two arguments one is query molecule which would like to change scaffold and second one is query which is target core for shuffling.

The make_connectors function generates reaction objects. The objects memorize the position where the rgroups are attached the query core.

The re_construction function reconstructs molecule from r-groups and new atom shuffled core.

The generate_mols function generate shuffled heteroaromatic scaffold. The function generates possible combinations of atoms with ‘itertools.product’ method. And checkmol function filters undesirable molecules.

class HeteroShuffle():
    
    def __init__(self, mol, query):
        self.mol = mol
        self.query = query
        self.subs = Chem.ReplaceCore(self.mol, self.query)
        self.core = Chem.ReplaceSidechains(self.mol, self.query)
        self.target_atomic_nums = [6, 7, 8, 16]
    
    
    def make_connectors(self):
        n = len(Chem.MolToSmiles(self.subs).split('.'))
        map_no = n+1
        self.rxn_dict = {}
        for i in range(n):
            self.rxn_dict[i+1] = AllChem.ReactionFromSmarts('[{0}*][*:{1}].[{0}*][*:{2}]>>[*:{1}][*:{2}]'.format(i+1, map_no, map_no+1))
        return self.rxn_dict

    def re_construct_mol(self, core):
        '''
        re construct mols from given substructures and core
        '''
        keys = self.rxn_dict.keys()
        ps = [[core]]
        for key in keys:
            ps = self.rxn_dict[key].RunReactants([ps[0][0], self.subs])
        mol = ps[0][0]
        try:
            smi = Chem.MolToSmiles(mol)
            mol = Chem.MolFromSmiles(smi)
            Chem.SanitizeMol(mol)
            return mol
        except:
            return None

    def get_target_atoms(self):
        '''
        get target atoms for replace
        target atoms means atoms which don't have anyatom(*) in neighbors
        '''
        atoms = []
        for atom in self.core.GetAromaticAtoms():
            neighbors = [a.GetSymbol() for a in atom.GetNeighbors()]
            if '*' not in neighbors and atom.GetSymbol() !='*':
                atoms.append(atom)
        print(len(atoms))
        return atoms
    
    def generate_mols(self):
        atoms = self.get_target_atoms()
        idxs = [atom.GetIdx() for atom in atoms]
        combinations = itertools.product(self.target_atomic_nums, repeat=len(idxs))
        smiles_set = set()
        self.make_connectors()
        for combination in combinations:
            target = copy.deepcopy(self.core)
            #print(Chem.MolToSmiles(target))
            for i, idx in enumerate(idxs):
                target.GetAtomWithIdx(idx).SetAtomicNum(combination[i])
            smi = Chem.MolToSmiles(target)
            #smi = smi.replace('sH','s').replace('oH','o').replace('cH3','c')
            #print('rep '+smi)
            target = Chem.MolFromSmiles(smi)
            if target != None:
                n_attachment = len([atom for atom in target.GetAtoms() if atom.GetAtomicNum() == 0])
                n_aromatic_atoms = len(list(target.GetAromaticAtoms()))
                if target.GetNumAtoms() - n_attachment == n_aromatic_atoms:
                    try:
                        mol = self.re_construct_mol(target)  
                        if checkmol(mol):
                            smiles_set.add(Chem.MolToSmiles(mol))
                    except:
                        pass
        mols = [Chem.MolFromSmiles(smi) for smi in smiles_set]
        return mols

Now ready. Let’s test the code. First example is fused six-membered hetero cycles as a query.

# Gefitinib
mol1 = Chem.MolFromSmiles('COC1=C(C=C2C(=C1)N=CN=C2NC3=CC(=C(C=C3)F)Cl)OCCCN4CCOCC4')
core1 = Chem.MolFromSmiles('c1ccc2c(c1)cncn2')
ht=HeteroSuffle(mol1, core1)
res=ht.generate_mols()
Draw.MolsToGridImage(res, molsPerRow=5)

Second one is a molecule with five membered ring as a query.

#  Oxaprozin
mol2 = Chem.MolFromSmiles('OC(=O)CCC1=NC(=C(O1)C1=CC=CC=C1)C1=CC=CC=C1')
core2 =  Chem.MolFromSmiles('c1cnco1')
ht=HeteroSuffle(mol2, core2)
res=ht.generate_mols()
Draw.MolsToGridImage(res, molsPerRow=5)

It worked. The code generates all possible combinations include undesired rings (i.e. oxygen contained six membered heteroaromatic systems). But it seems easy to generate scaffold diverse molecules.

All code can check from following URL. The code is a piece of the code of thin book for chemoinformatics, py4chemoinformatcs.

https://nbviewer.jupyter.org/github/Mishima-syk/py4chemoinformatics/blob/master/notebooks/ch05_hetero_shuffle.ipynb

Now we started the book translation from Japanese to English with contributors.

Draw molecular network on Jupyter notebook with rdkit and cytoscape.js-2 #RDKit #cytoscape

Yesterday, I posted about cyjupyter. And got comment how to render the compound name on each node.

I think it is possible to use context attribute in style settings.

Let’s try. Code is almost same as yesterday

import networkx as nx
from networkx.readwrite import cytoscape_data
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdChemReactions
from rdkit.Chem.Draw import rdMolDraw2D
import cyjupyter
from cyjupyter import Cytoscape
import os
from urllib import parse
mmp_path = os.path.join(RDConfig.RDContribDir, "mmpa/data/sample_mmps_default.csv")
lines = []
with open(mmp_path, 'r') as f:
    for line in f:
        lines.append(line.rstrip().split(','))
nodes = set()
for line in lines:
    nodes.add(line[0])
    nodes.add(line[1])
nodes = list(nodes)
print(str(len(nodes))+' mols')
def smi2svg(smi):
    mol = Chem.MolFromSmiles(smi)
    try:
        Chem.rdmolops.Kekulize(mol)
    except:
        pass
    drawer = rdMolDraw2D.MolDraw2DSVG(400, 200)
    AllChem.Compute2DCoords(mol)
    drawer.DrawMolecule(mol)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText().replace("svg:", "")
    return svg

def smi2image(smi):
    svg_string = smi2svg(smi)
    impath = 'data:image/svg+xml;charset=utf-8,' + parse.quote(svg_string, safe="")
    return impath

Networkx is easy to make network with many attributes. I set smiles as name attribute.

g = nx.Graph()
for node in nodes:
    g.add_node(node, img=smi2image(node), name=node)
for line in lines:
    g.add_edge(line[0], line[1])
stobj=[
  {'style': [{'css': {
      'background-color': 'blue',
      'shape' : 'rectangle',
      'width':800,
      'height':400,
      'border-color': 'rgb(0,0,0)',
      'border-opacity': 1.0,
      'border-width': 0.0,
      'color': '#4579e8',
      'background-image':'data(img)',
      'background-fit':'contain',
      'content':'data(name)',
      "font-size": "60px",
                    },
    'selector': 'node'},
            {'css': {
                'width': 20.0,
            },
            'selector': 'edge'}
            ],
  }]

OK let’s draw network!

cyobj=Cytoscape(data=cy_g, visual_style=stobj[0]['style'], layout_name='cose')
cyobj

Now, SMILES strings can be shown as node name.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/cytoscapejs_py-Copy1.ipynb

Draw molecular network on Jupyter notebook with rdkit and cytoscape.js #RDKit #cytoscape

I use cytsocape and cytoscape.js when I would like to draw molecular network. Molecular network can be made from similarity, matched molecular pair etc. In cytoscape there is a plug in for drawing chemical structures named ‘chemviz‘. There is no plugin for cytoscape.js. So it is needed for drawing function which draw chemical structures as node.

Recent version of RDKit provides SVG rendering function of chemical structure. Today I tried to molecular network which shows structures as nodes.

For convenience, I used cyjupyter which is an interactive network visualizer for Jupyter notebook. Referred the discussion, cytoscape.js can render the node with SVG as background-image. I used parse.quote function instead of encodeURLComponent because I wrote the code with python. ;)

Let’s try!
At first, I imported some libraries for drawing network and read sample data which got from rdkit/contrib/mmpa/data.

import networkx as nx
from networkx.readwrite import cytoscape_data
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdChemReactions
from rdkit.Chem.Draw import rdMolDraw2D
import cyjupyter
from cyjupyter import Cytoscape
import os
from urllib import parse
mmp_path = os.path.join(RDConfig.RDContribDir, "mmpa/data/sample_mmps_default.csv")
lines = []
with open(mmp_path, 'r') as f:
    for line in f:
        lines.append(line.rstrip().split(','))
nodes = set()
for line in lines:
    nodes.add(line[0])
    nodes.add(line[1])
nodes = list(nodes)
print(str(len(nodes))+' mols')

Data format of sample_mmps_default.csv is molA, molB, id_mol_A, id_mol_B, transform. So I used molA and molB for input.

Next I defined the converter which transform SMIELS to image. Default setting of parse.quote is safe=”/”. I used safe=”” because I would like to escape the option.

def smi2svg(smi):
    mol = Chem.MolFromSmiles(smi)
    try:
        Chem.rdmolops.Kekulize(mol)
    except:
        pass
    drawer = rdMolDraw2D.MolDraw2DSVG(400, 200)
    AllChem.Compute2DCoords(mol)
    drawer.DrawMolecule(mol)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText().replace("svg:", "")
    return svg

def smi2image(smi):
    svg_string = smi2svg(smi)
    impath = 'data:image/svg+xml;charset=utf-8,' + parse.quote(svg_string, safe="")
    return impath

Next, make MMP network with networkx and convert the data to cytoscape readable format. It is very simple. Please see below.

g = nx.Graph()
for node in nodes:
    g.add_node(node, img=smi2image(node))
for line in lines:
    g.add_edge(line[0], line[1])
cy_g = cytoscape_data(g)

Now almost there. Define style settings. Key point of the part is setting of background-image of nodes. It can get each node img attribute with ‘data(img)’.

stobj=[
  {'style': [{'css': {
      'background-color': 'blue',
      'shape' : 'rectangle',
      'width':800,
      'height':400,
      'border-color': 'rgb(0,0,0)',
      'border-opacity': 1.0,
      'border-width': 0.0,
      'color': '#4579e8',
      'background-image':'data(img)',
      'background-fit':'contain'
                    },
    'selector': 'node'},
            {'css': {
                'width': 20.0,
            },
            'selector': 'edge'}
            ],
  }]

OK let’s draw network!

cyobj=Cytoscape(data=cy_g, visual_style=stobj[0]['style'], layout_name='cose')
cyobj

It worked fine. The example is very simple and small. Network can have many attribute on not only nodes but also edges. And use can set many style settings.

Rendering structure on the fly and visualize it as node is useful and interesting tips I think.

I uploaded the code above on my github repo.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/cytoscapejs_py.ipynb

Visualize HOMO LUMO with psi4 #RDKit #psi4 #psikit

Now thin wrapper of psi4 named psikit can generate cube file which has frontier orbital information. After calling getMOview, I would like to check the orbital shape.

Psi4 provides vmd_cube.py script which generates cool view on VMD. To run the script on python3, it is needed to change line 332 from ‘for k,v in options.iteritems():’ to ‘for k,v in options.items():’.

After changed the code. Ready to visualize molecular orbital! Let’s check it.
I run the following code on jupyter notebook.

import os
import sys
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
sys.path.append('/Users/iwatobipen/develop/chemoenv/psikit/psikit/')
from psikit import Psikit
pk = Psikit()
pk.read_from_smiles('COc1ccncc1')
pk.optimize()
>Optimizer: Optimization complete!
>-360.5913889506649

Then call getMOview() to generate MO cube files.

Next, run the vmd_cube.py. The script has many options. To check the option, vmd_cube.py -h gives help.

I run the code interactive mode. After run the code, I could see MO on VMD.

To run the code, reader need install VMD in your PC. Of course the script generates image of each MO.

I think it is interesting that visualize MO of protein-ligand complex because it will show interaction of protein and ligand. It is very useful for understanding of the interaction. Any comments or suggestions are greatly appreciated. ;)