Handling chemoinformatics data with pandas #RDKit #chemoinformatics

I often use Pandas for data analysis. RDKit provides useful method named PandasTools. The method can load sdf and return data as pandas dataframe. By using dataframe, It isn’t needed to do something with for loop.

I found an interesting information in rdkit issues. A package named pyjanitor.

The package wraps pandas and provides useful method not only basic data handling but also chemoinformatics.

The package can be installed with conda from conda-forge channel. I used the library. Following code is example on my jupyter notebook. At first, read libraries for the trial.

%matplotlib inline
import matplotlib.pyplot as plt
import os
from rdkit import Chem
from rdkit import RDConfig

import pandas as pd
import janitor
from janitor import chemistry

from rdkit.Chem import PandasTools
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA

Then load sdf data which is provided from rdkit.

path = os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf')
df = PandasTools.LoadSDF(path)

pyjanitor has chemoinformatics function, one is morgan_fingerprint. It can calculate data and add the result in current dataframe. To do it, it is very easy ;)

fp1=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='bits')

Now I can get fingerprint information as pandas dataframe which shape is (num of compounds, num of bits).

Conduct PCA with fingerprint.

fp2=chemistry.morgan_fingerprint(df, mols_col='ROMol', radius=2, nbits=512, kind='counts')
pca = PCA(n_components=3)
pca_res = pca.fit_transform(fp2)

janitor.chemistry has smi2mol method. It converts SMILES to mol objects in pandas dataframe.

df['SMILES'] = df.ROMol.apply(Chem.MolToSmiles)
df = chemistry.smiles2mol(df, smiles_col='SMILES', mols_col='newROMol', progressbar='notebook')

And janitor has add_column method. To use the method, I can add multiple columns with chain method.

# add_column function is not native pandas method.
# df['NumAtm'] = df.ROMol.apply(Chem.Mol.GetAtoms) 
df = df.add_column('NumAtm', [Chem.Mol.GetNumAtoms(mol) for mol in df.ROMol])

It is easy to make fingerprint dataframe with janitor. And pyjanitor has other useful methods for pandas data handling.


New version of openforcefield supports RDKit! #RDKit #chemoinformatics #openforcefield

In this morning I found great news on my tiwtter TL that openforcefield supports RDKit! Older version of openforcefiled is required openeyeTK which is commercial license for industry. I had interested in the package but could not install because I’m not academia and our company does not OETK license now.


I can’t wait to use it so I installed the package and check it immediately.
Following code is ran on google colab. If would like to run on local env, you can skip several steps.

# install conda
!wget https://repo.continuum.io/miniconda/Miniconda3-4.4.10-Linux-x86_64.sh
!chmod ./Miniconda3-4.4.10-Linux-x86_64.sh
!time bash ./Miniconda3-4.4.10-Linux-x86_64.sh -b -f -p /usr/local
# install rdkit via conda command
!time conda install -c conda-forge rdkit -y

import sys
import os
from rdkit import Chem
from rdkit import rdBase

Then install openforcefield. Openforcefield is required openmm. Cuda version of colab was 10.0 so I did not need change of openmm install version. So, just type install ‘conda install openforcefield’. ;)

! conda config --add channels omnia --add channels conda-forge
! conda install -y openforcefield

Now ready, let’s call openforcefield.

from openforcefield.topology import Molecule
from openforcefield.topology import Topology
from openforcefield.utils import RDKitToolkitWrapper
from openforcefield.typing.engines.smirnoff import ForceField
from simtk import openmm
from simtk import unit

Following code was borrowed from original repo.

def get_energy(system, positions):
    Return the potential energy.

    system : simtk.openmm.System
        The system to check
    positions : simtk.unit.Quantity of dimension (natoms,3) with units of length
        The positions to use

    integrator = openmm.VerletIntegrator(1.0 * unit.femtoseconds)
    context = openmm.Context(system, integrator)
    state = context.getState(getEnergy=True)
    energy = state.getPotentialEnergy().in_units_of(unit.kilocalories_per_mole)
    return energy

Load molecule with openFF method and calculate energy.

mol = Molecule.from_smiles('CCO')
positions = mol.conformers[0]
#load Force Field
ff = ForceField('smirnoff99Frosst.offxml')
topology = mol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
> 2.3399880921635527 kcal/mol

Next calculate energy from rdkit mol object.

mol = Chem.MolFromSmiles('c1ccccc1')
rdw = RDKitToolkitWrapper()
ofmol = rdw.from_rdkit(mol)
positions = ofmol.conformers[0]
topology = ofmol.to_topology()
orginal_system = ff.create_openmm_system(topology)
orig_energy = get_energy(orginal_system, positions)
>7.987402593200566 kcal/mol

This is very simple example of OpenFF usage with RDKit. The package seems powerful for CADD. I would like to learn the package more and molecular dynamics.

Whole code of the post can check from my repo.

And I really thank to developer of RDKit, Openforcefield and many open source chemoinformatics tools.


Make Graph convolution model with geometric deep learning extension library for PyTorch #RDKit #chemoinformatics #pytorch

In the chemoinformatics area, QSAR by using molecular graph as input is very hot topic. Examples of major implementations are deepchem and chainer-chemistry I think. I also have interest about Graph based QSAR model building. Recently I am using pytorch for my task of deeplearning so I would like to build model with pytorch. Fortunately very elegant package is provided for pytorch named ‘pytorch_geometric‘. PyTorch Geometric is a geometric deep learning extension library for PyTorch.

Today I tried to build GCN model with the package.

At first I defined function of mol to graph which convert molecule to graph vector. Most part of the code borrowed from DeepChem.

Data structure of torch_geometry is described in this URL. I defined molecular graph as undirected graph.

from __future__ import division
from __future__ import unicode_literals
import numpy as np
from rdkit import Chem
import multiprocessing
import logging
import torch
from torch_geometric.data import Data

# following code was borrowed from deepchem
# https://raw.githubusercontent.com/deepchem/deepchem/master/deepchem/feat/graph_features.py

def one_of_k_encoding(x, allowable_set):
  if x not in allowable_set:
    raise Exception("input {0} not in allowable set{1}:".format(
        x, allowable_set))
  return list(map(lambda s: x == s, allowable_set))

def one_of_k_encoding_unk(x, allowable_set):
  """Maps inputs not in the allowable set to the last element."""
  if x not in allowable_set:
    x = allowable_set[-1]
  return list(map(lambda s: x == s, allowable_set))

def get_intervals(l):
  """For list of lists, gets the cumulative products of the lengths"""
  intervals = len(l) * [0]
  # Initalize with 1
  intervals[0] = 1
  for k in range(1, len(l)):
    intervals[k] = (len(l[k]) + 1) * intervals[k - 1]

  return intervals

def safe_index(l, e):
  """Gets the index of e in l, providing an index of len(l) if not found"""
    return l.index(e)
    return len(l)

possible_atom_list = [
    'C', 'N', 'O', 'S', 'F', 'P', 'Cl', 'Mg', 'Na', 'Br', 'Fe', 'Ca', 'Cu',
    'Mc', 'Pd', 'Pb', 'K', 'I', 'Al', 'Ni', 'Mn'
possible_numH_list = [0, 1, 2, 3, 4]
possible_valence_list = [0, 1, 2, 3, 4, 5, 6]
possible_formal_charge_list = [-3, -2, -1, 0, 1, 2, 3]
possible_hybridization_list = [
    Chem.rdchem.HybridizationType.SP, Chem.rdchem.HybridizationType.SP2,
    Chem.rdchem.HybridizationType.SP3, Chem.rdchem.HybridizationType.SP3D,
possible_number_radical_e_list = [0, 1, 2]
possible_chirality_list = ['R', 'S']

reference_lists = [
    possible_atom_list, possible_numH_list, possible_valence_list,
    possible_formal_charge_list, possible_number_radical_e_list,
    possible_hybridization_list, possible_chirality_list

intervals = get_intervals(reference_lists)

def get_feature_list(atom):
  features = 6 * [0]
  features[0] = safe_index(possible_atom_list, atom.GetSymbol())
  features[1] = safe_index(possible_numH_list, atom.GetTotalNumHs())
  features[2] = safe_index(possible_valence_list, atom.GetImplicitValence())
  features[3] = safe_index(possible_formal_charge_list, atom.GetFormalCharge())
  features[4] = safe_index(possible_number_radical_e_list,
  features[5] = safe_index(possible_hybridization_list, atom.GetHybridization())
  return features

def features_to_id(features, intervals):
  """Convert list of features into index using spacings provided in intervals"""
  id = 0
  for k in range(len(intervals)):
    id += features[k] * intervals[k]

  # Allow 0 index to correspond to null molecule 1
  id = id + 1
  return id

def id_to_features(id, intervals):
  features = 6 * [0]

  # Correct for null
  id -= 1

  for k in range(0, 6 - 1):
    # print(6-k-1, id)
    features[6 - k - 1] = id // intervals[6 - k - 1]
    id -= features[6 - k - 1] * intervals[6 - k - 1]
  # Correct for last one
  features[0] = id
  return features

def atom_to_id(atom):
  """Return a unique id corresponding to the atom type"""
  features = get_feature_list(atom)
  return features_to_id(features, intervals)

def atom_features(atom,
  if bool_id_feat:
    return np.array([atom_to_id(atom)])
    from rdkit import Chem
    results = one_of_k_encoding_unk(
        'H',  # H?
      ]) + one_of_k_encoding(atom.GetDegree(),
                             [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) + \
              one_of_k_encoding_unk(atom.GetImplicitValence(), [0, 1, 2, 3, 4, 5, 6]) + \
              [atom.GetFormalCharge(), atom.GetNumRadicalElectrons()] + \
              one_of_k_encoding_unk(atom.GetHybridization(), [
                Chem.rdchem.HybridizationType.SP, Chem.rdchem.HybridizationType.SP2,
                Chem.rdchem.HybridizationType.SP3, Chem.rdchem.HybridizationType.
                                    SP3D, Chem.rdchem.HybridizationType.SP3D2
              ]) + [atom.GetIsAromatic()]
    # In case of explicit hydrogen(QM8, QM9), avoid calling `GetTotalNumHs`
    if not explicit_H:
      results = results + one_of_k_encoding_unk(atom.GetTotalNumHs(),
                                                [0, 1, 2, 3, 4])
    if use_chirality:
        results = results + one_of_k_encoding_unk(
            ['R', 'S']) + [atom.HasProp('_ChiralityPossible')]
        results = results + [False, False
                            ] + [atom.HasProp('_ChiralityPossible')]

    return np.array(results)

def bond_features(bond, use_chirality=False):
  from rdkit import Chem
  bt = bond.GetBondType()
  bond_feats = [
      bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE,
      bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC,
  if use_chirality:
    bond_feats = bond_feats + one_of_k_encoding_unk(
  return np.array(bond_feats)

# pen added
def get_bond_pair(mol):
  bonds = mol.GetBonds()
  res = [[],[]]
  for bond in bonds:
    res[0] += [bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()]
    res[1] += [bond.GetEndAtomIdx(), bond.GetBeginAtomIdx()]
  return res

def mol2vec(mol):
  atoms = mol.GetAtoms()
  bonds = mol.GetBonds()
  node_f= [atom_features(atom) for atom in atoms]
  edge_index = get_bond_pair(mol)
  edge_attr = [bond_features(bond, use_chirality=False) for bond in bonds]
  for bond in bonds:
  data = Data(x=torch.tensor(node_f, dtype=torch.float),
              edge_index=torch.tensor(edge_index, dtype=torch.long),
  return data

Now finished define mol2graph. I tried to make GCN class. Following code is written on ipython-notebook. So I call some method for interactive visualization. At first load packages and load test data. I used solubility dataset which is provided from RDKit.

%matplotlib inline
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
import torch
import torch.nn.functional as F
from torch.nn import Linear
from torch.nn import BatchNorm1d
from torch.utils.data import Dataset
from torch_geometric.nn import GCNConv
from torch_geometric.nn import ChebConv
from torch_geometric.nn import global_add_pool, global_mean_pool
from torch_geometric.data import DataLoader
from torch_scatter import scatter_mean
import mol2graph
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw


train_mols = [m for m in Chem.SDMolSupplier('solubility.train.sdf')]
test_mols = [m for m in Chem.SDMolSupplier('solubility.test.sdf')]
sol_cls_dict = {'(A) low':0, '(B) medium':1, '(C) high':2}

Then convert molecule to graph with the defined function and added label for training. Defined data loader next.

train_X = [mol2graph.mol2vec(m) for m in train_mols]
for i, data in enumerate(train_X):
    y = sol_cls_dict[train_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)

test_X = [mol2graph.mol2vec(m) for m in test_mols]
for i, data in enumerate(test_X):
    y = sol_cls_dict[test_mols[i].GetProp('SOL_classification')]
    data.y = torch.tensor([y], dtype=torch.long)
train_loader = DataLoader(train_X, batch_size=64, shuffle=True, drop_last=True)
test_loader = DataLoader(test_X, batch_size=64, shuffle=True, drop_last=True)

Then I defined model architecture for GCN. The implementation is very deferent for original article.

n_features = 75
# definenet
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = GCNConv(n_features, 128, cached=False) # if you defined cache=True, the shape of batch must be same!
        self.bn1 = BatchNorm1d(128)
        self.conv2 = GCNConv(128, 64, cached=False)
        self.bn2 = BatchNorm1d(64)
        self.fc1 = Linear(64, 64)
        self.bn3 = BatchNorm1d(64)
        self.fc2 = Linear(64, 64)
        self.fc3 = Linear(64, 3)
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = F.relu(self.conv1(x, edge_index))
        x = self.bn1(x)
        x = F.relu(self.conv2(x, edge_index))
        x = self.bn2(x)
        x = global_add_pool(x, data.batch)
        x = F.relu(self.fc1(x))
        x = self.bn3(x)
        x = F.relu(self.fc2(x))
        x = F.dropout(x, p=0.2, training=self.training)
        x = self.fc3(x)
        x = F.log_softmax(x, dim=1)
        return x       

After defined model, I tried to train the model and evaluate the performance.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
def train(epoch):
    loss_all = 0
    for data in train_loader:
        data = data.to(device)
        output = model(data)
        loss = F.nll_loss(output, data.y)
        loss_all += loss.item() * data.num_graphs
    return loss_all / len(train_X)
def test(loader):
    correct = 0
    for data in loader:
        data = data.to(device)
        output = model(data)
        pred = output.max(dim=1)[1]
        correct += pred.eq(data.y).sum().item()
    return correct / len(loader.dataset)
hist = {"loss":[], "acc":[], "test_acc":[]}
for epoch in range(1, 101):
    train_loss = train(epoch)
    train_acc = test(train_loader)
    test_acc = test(test_loader)
    print(f'Epoch: {epoch}, Train loss: {train_loss:.3}, Train_acc: {train_acc:.3}, Test_acc: {test_acc:.3}')
ax = plt.subplot(1,1,1)
ax.plot([e for e in range(1,101)], hist["loss"], label="train_loss")
ax.plot([e for e in range(1,101)], hist["acc"], label="train_acc")
ax.plot([e for e in range(1,101)], hist["test_acc"], label="test_acc")

After training, I could get following image.

It seems not so bad. Pytorch_geometry has many algorithms for graph based deep learning.

I think it is very cool and useful package for chemoinformatian. ;)

All code of the post is uploaded the URL below.


Generate possible heteroaromatic cores from query molecule #RDKit #chemoinformatics

Hetero shuffling is the approach which replace atoms of scaffold and generate new molecule with atom replaced scaffold. For example benzene as core, examples of shuffled cores will be pyridine, pyrimidine etc.

The approach is often used medicinal chemistry project to improve ADMET properties, biological activities and also used for substance patent claim strategy. Native RDKit can’t do it directly. So I would like to code to do that.

Let’s try. At first always import required packages. Following code wrote on Jupyter-Notebook.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import Fragments
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
import numpy as np
import copy
from rdkit.Chem import AllChem
import itertools

Then I defined function named checkmol which extract atoms which is not inringsize 5 and check atom symbol is not sulphur or oxygen. Because normally S or O contained six membered aromatic ring is not used for building blocks. And my code generates all combination of aromatic atoms, so I need the check function.

def checkmol(mol):
    arom_atoms = mol.GetAromaticAtoms()
    symbols = [atom.GetSymbol() for atom in arom_atoms if not atom.IsInRingSize(5)]
    if symbols == []:
        return True
    elif 'O' in symbols or 'S' in symbols:
        return False
        return True

Next I defined main function. HeteroShuffle class is required two arguments one is query molecule which would like to change scaffold and second one is query which is target core for shuffling.

The make_connectors function generates reaction objects. The objects memorize the position where the rgroups are attached the query core.

The re_construction function reconstructs molecule from r-groups and new atom shuffled core.

The generate_mols function generate shuffled heteroaromatic scaffold. The function generates possible combinations of atoms with ‘itertools.product’ method. And checkmol function filters undesirable molecules.

class HeteroShuffle():
    def __init__(self, mol, query):
        self.mol = mol
        self.query = query
        self.subs = Chem.ReplaceCore(self.mol, self.query)
        self.core = Chem.ReplaceSidechains(self.mol, self.query)
        self.target_atomic_nums = [6, 7, 8, 16]
    def make_connectors(self):
        n = len(Chem.MolToSmiles(self.subs).split('.'))
        map_no = n+1
        self.rxn_dict = {}
        for i in range(n):
            self.rxn_dict[i+1] = AllChem.ReactionFromSmarts('[{0}*][*:{1}].[{0}*][*:{2}]>>[*:{1}][*:{2}]'.format(i+1, map_no, map_no+1))
        return self.rxn_dict

    def re_construct_mol(self, core):
        re construct mols from given substructures and core
        keys = self.rxn_dict.keys()
        ps = [[core]]
        for key in keys:
            ps = self.rxn_dict[key].RunReactants([ps[0][0], self.subs])
        mol = ps[0][0]
            smi = Chem.MolToSmiles(mol)
            mol = Chem.MolFromSmiles(smi)
            return mol
            return None

    def get_target_atoms(self):
        get target atoms for replace
        target atoms means atoms which don't have anyatom(*) in neighbors
        atoms = []
        for atom in self.core.GetAromaticAtoms():
            neighbors = [a.GetSymbol() for a in atom.GetNeighbors()]
            if '*' not in neighbors and atom.GetSymbol() !='*':
        return atoms
    def generate_mols(self):
        atoms = self.get_target_atoms()
        idxs = [atom.GetIdx() for atom in atoms]
        combinations = itertools.product(self.target_atomic_nums, repeat=len(idxs))
        smiles_set = set()
        for combination in combinations:
            target = copy.deepcopy(self.core)
            for i, idx in enumerate(idxs):
            smi = Chem.MolToSmiles(target)
            #smi = smi.replace('sH','s').replace('oH','o').replace('cH3','c')
            #print('rep '+smi)
            target = Chem.MolFromSmiles(smi)
            if target != None:
                n_attachment = len([atom for atom in target.GetAtoms() if atom.GetAtomicNum() == 0])
                n_aromatic_atoms = len(list(target.GetAromaticAtoms()))
                if target.GetNumAtoms() - n_attachment == n_aromatic_atoms:
                        mol = self.re_construct_mol(target)  
                        if checkmol(mol):
        mols = [Chem.MolFromSmiles(smi) for smi in smiles_set]
        return mols

Now ready. Let’s test the code. First example is fused six-membered hetero cycles as a query.

# Gefitinib
mol1 = Chem.MolFromSmiles('COC1=C(C=C2C(=C1)N=CN=C2NC3=CC(=C(C=C3)F)Cl)OCCCN4CCOCC4')
core1 = Chem.MolFromSmiles('c1ccc2c(c1)cncn2')
ht=HeteroSuffle(mol1, core1)
Draw.MolsToGridImage(res, molsPerRow=5)

Second one is a molecule with five membered ring as a query.

#  Oxaprozin
mol2 = Chem.MolFromSmiles('OC(=O)CCC1=NC(=C(O1)C1=CC=CC=C1)C1=CC=CC=C1')
core2 =  Chem.MolFromSmiles('c1cnco1')
ht=HeteroSuffle(mol2, core2)
Draw.MolsToGridImage(res, molsPerRow=5)

It worked. The code generates all possible combinations include undesired rings (i.e. oxygen contained six membered heteroaromatic systems). But it seems easy to generate scaffold diverse molecules.

All code can check from following URL. The code is a piece of the code of thin book for chemoinformatics, py4chemoinformatcs.


Now we started the book translation from Japanese to English with contributors.

Draw molecular network on Jupyter notebook with rdkit and cytoscape.js-2 #RDKit #cytoscape

Yesterday, I posted about cyjupyter. And got comment how to render the compound name on each node.

I think it is possible to use context attribute in style settings.

Let’s try. Code is almost same as yesterday

import networkx as nx
from networkx.readwrite import cytoscape_data
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdChemReactions
from rdkit.Chem.Draw import rdMolDraw2D
import cyjupyter
from cyjupyter import Cytoscape
import os
from urllib import parse
mmp_path = os.path.join(RDConfig.RDContribDir, "mmpa/data/sample_mmps_default.csv")
lines = []
with open(mmp_path, 'r') as f:
    for line in f:
nodes = set()
for line in lines:
nodes = list(nodes)
print(str(len(nodes))+' mols')
def smi2svg(smi):
    mol = Chem.MolFromSmiles(smi)
    drawer = rdMolDraw2D.MolDraw2DSVG(400, 200)
    svg = drawer.GetDrawingText().replace("svg:", "")
    return svg

def smi2image(smi):
    svg_string = smi2svg(smi)
    impath = 'data:image/svg+xml;charset=utf-8,' + parse.quote(svg_string, safe="")
    return impath

Networkx is easy to make network with many attributes. I set smiles as name attribute.

g = nx.Graph()
for node in nodes:
    g.add_node(node, img=smi2image(node), name=node)
for line in lines:
    g.add_edge(line[0], line[1])
  {'style': [{'css': {
      'background-color': 'blue',
      'shape' : 'rectangle',
      'border-color': 'rgb(0,0,0)',
      'border-opacity': 1.0,
      'border-width': 0.0,
      'color': '#4579e8',
      "font-size": "60px",
    'selector': 'node'},
            {'css': {
                'width': 20.0,
            'selector': 'edge'}

OK let’s draw network!

cyobj=Cytoscape(data=cy_g, visual_style=stobj[0]['style'], layout_name='cose')

Now, SMILES strings can be shown as node name.


Draw molecular network on Jupyter notebook with rdkit and cytoscape.js #RDKit #cytoscape

I use cytsocape and cytoscape.js when I would like to draw molecular network. Molecular network can be made from similarity, matched molecular pair etc. In cytoscape there is a plug in for drawing chemical structures named ‘chemviz‘. There is no plugin for cytoscape.js. So it is needed for drawing function which draw chemical structures as node.

Recent version of RDKit provides SVG rendering function of chemical structure. Today I tried to molecular network which shows structures as nodes.

For convenience, I used cyjupyter which is an interactive network visualizer for Jupyter notebook. Referred the discussion, cytoscape.js can render the node with SVG as background-image. I used parse.quote function instead of encodeURLComponent because I wrote the code with python. ;)

Let’s try!
At first, I imported some libraries for drawing network and read sample data which got from rdkit/contrib/mmpa/data.

import networkx as nx
from networkx.readwrite import cytoscape_data
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import rdChemReactions
from rdkit.Chem.Draw import rdMolDraw2D
import cyjupyter
from cyjupyter import Cytoscape
import os
from urllib import parse
mmp_path = os.path.join(RDConfig.RDContribDir, "mmpa/data/sample_mmps_default.csv")
lines = []
with open(mmp_path, 'r') as f:
    for line in f:
nodes = set()
for line in lines:
nodes = list(nodes)
print(str(len(nodes))+' mols')

Data format of sample_mmps_default.csv is molA, molB, id_mol_A, id_mol_B, transform. So I used molA and molB for input.

Next I defined the converter which transform SMIELS to image. Default setting of parse.quote is safe=”/”. I used safe=”” because I would like to escape the option.

def smi2svg(smi):
    mol = Chem.MolFromSmiles(smi)
    drawer = rdMolDraw2D.MolDraw2DSVG(400, 200)
    svg = drawer.GetDrawingText().replace("svg:", "")
    return svg

def smi2image(smi):
    svg_string = smi2svg(smi)
    impath = 'data:image/svg+xml;charset=utf-8,' + parse.quote(svg_string, safe="")
    return impath

Next, make MMP network with networkx and convert the data to cytoscape readable format. It is very simple. Please see below.

g = nx.Graph()
for node in nodes:
    g.add_node(node, img=smi2image(node))
for line in lines:
    g.add_edge(line[0], line[1])
cy_g = cytoscape_data(g)

Now almost there. Define style settings. Key point of the part is setting of background-image of nodes. It can get each node img attribute with ‘data(img)’.

  {'style': [{'css': {
      'background-color': 'blue',
      'shape' : 'rectangle',
      'border-color': 'rgb(0,0,0)',
      'border-opacity': 1.0,
      'border-width': 0.0,
      'color': '#4579e8',
    'selector': 'node'},
            {'css': {
                'width': 20.0,
            'selector': 'edge'}

OK let’s draw network!

cyobj=Cytoscape(data=cy_g, visual_style=stobj[0]['style'], layout_name='cose')

It worked fine. The example is very simple and small. Network can have many attribute on not only nodes but also edges. And use can set many style settings.

Rendering structure on the fly and visualize it as node is useful and interesting tips I think.

I uploaded the code above on my github repo.


Make MMP network and send to cytoscape #chemoinfo

Recently I use cytoscape in my laboratory. You know Cytoscape is nice tool for network visualization.
I often make data with python and import data from cytoscape. The work flow is not so bad but I am thinking that it will be nice if python can communicate with cytoscape.
Fortunately cytocape has REST plugin called cyREST and also python has py2cytoscape to do it!
It sounds nice. I tried to use these libraries.
At first I installed cyREST to my cytoscape (v3.5.1). appmanager => cyREST >
And also I installed chemviz for drawing chemical structure in cytoscape.
Then install py2cytoscape via pip. ;-)
You can access localhost:1234/v1 from web blowser when cytoscape launched if cyREST is successfully installed.

I drew simple MMP network. Code is below.
First, I made MMP from SMILES file by using RDKit MMP script.

$ cat testdata.smi
Oc1ccccc1 phenol
Oc1ccccc1O catechol
Oc1ccccc1N 2-aminophenol
Oc1ccccc1Cl 2-chlorophenol
Nc1ccccc1N o-phenylenediamine
Nc1cc(O)ccc1N amidol
Oc1cc(O)ccc1O hydroxyquinol
Nc1ccccc1 phenylamine
C1CCCC1N cyclopentanol
$ python rfrag.py < testdata.smi> testdata.frag
$ python indexing.py < testdata.frag > testmmp.txt -r 0.2
$ cat testmmp.txt

Now I got MMP data I used the data to make edge of my network and testdata.smi is used to make node data.

Next code is example for communication between python and cytoscape.
At first, import CyRestClient and make connection. Default URL is localhost and port is 1234. But if user would like to use another IP and Port, user can modify from cytoscape.
Edit => Preferences => add => rest.url xxxxx, rest.port xxxx

I used python-igraph for making graph but py2cytoscape handle data generated by networkx, geohi and something.

import igraph
from py2cytoscape.data.cyrest_client import CyRestClient
import py2cytoscape

cy = CyRestClient()
G = igraph.Graph()

with open('testdata.smi', 'r') as vertexis:
    for v in vertexis:
        G.add_vertex(v.split(' ')[0], molname=v.split(' ')[1])

with open("testmmp.txt", "r") as edges:
    for edge in edges:
        G.add_edge(edge.split(",")[0], edge.split(",")[1], transform=edge.split(",")[4])

After making network, go to next step.
network_create_from_igraph method receives data from igraph and send to cytoscape.
Then I set network layout ‘force-directed’.
Finally I set some view style and update the graph settings.

g_cy = cy.network.create_from_igraph(G)
cy.layout.apply(name='force-directed', network=g_cy)

mystyle = cy.style.create('mystyle')

defaults = {
    'NODE_HIGHT': 100,
    'NODE_WIDTH': 100,
    'NODE_LABEL_COLOR': '#555555',
    'EDGE_WIDTH': 20,

cy.style.apply(mystyle, network=g_cy)

View screenshots.
Before run the code, there is no network in cytoscape.

After run the code I could see MMP network without chemical structure.

Finally I set chemviz setting and run paint structure from menu, I could see structure on each node.

And also each node and edge has their own attribute that is set by igraph.
It interesting and useful because all work is done by using only python!

This example is one way python => cytoscape. But the library can send data in both directions.
There are nice documents written in Japanese such like a following URL.