Embed interactive plot in jupyter notebook with panel #chemoinformatics #RDKit #memo #panel

As you know Jupyter notebook is very useful tool for data scientist. It can analyze scientific data with nice view. And there are lots of packages for data visualization. And I often use matplotlib and seaborn for my task. However few days ago, I found an interesting package named Panel which is high level app and dashbording app for python. I posted another package dash before but I’ve never used panel. So I tried to use panel with rdkit.

Panel can install via conda from pyviz channel or pip. I installed panel by using conda.

After installed panel I tested it.

At first I tried to visualize rdkit mol object. Import packages and molecules.

import numpy as np
import pandas as pd
import os
import panel as pn
from rdkit import Chem
from rdkit import RDPaths
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import matplotlib.pyplot as plt
plt.style.use('ggplot')

sdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/cdk2.sdf')
mols = [m for m in Chem.SDMolSupplier(sdf)]
for m in mols:
    AllChem.Compute2DCoords(m)

Then made IntSlider to set index of molecule which would like to render. Following example uses depends decorator for making interactive view.

pn.extension()
slider = pn.widgets.IntSlider(start=0, end=len(mols), step=1, value=0)
@pn.depends(slider.param.value)
def callback(value):
    return Draw.MolToImage(mols[value])
row = pn.Column(slider, callback)
row

Now molecule was rendered with slider widget.

It seems work well, next I tried to add range slider to draw molecules. Code is almost same as above.

pn.extension()
rangeslider = pn.widgets.IntRangeSlider(start=0, end=len(mols), step=1)
@pn.depends(rangeslider.param.value)
def callback(value):
    return Draw.MolsToGridImage(mols[value[0]: value[1]], molsPerRow=5)
pn.Column(rangeslider, callback)

IntRangeSlider returns tuple of user defined value. So I could select range which would like to render on the notebook.

Next example, I tried to make interactive scatter plot of molecular properties.

To do it, I calculated molecular descriptors with rdkit Descriptors class.

from rdkit.Chem import Descriptors
from collections import defaultdict
dlist = Descriptors._descList
desc_dec = defaultdict(list)
for mol in mols:
    for k, f  in dlist:
        desc_dec[k].append(f(mol))
df = pd.DataFrame(desc_dec)

Following example used Select widget to select x and y axis and FloatSlider for setting alpha of scatter plot.

from matplotlib.figure import Figure, FigureCanvasBase
columns = df.columns.to_list()
x = pn.widgets.Select(options=columns, name='x_', value='MolWt')
y = pn.widgets.Select(options=columns, name='y_', value='qed')
alpha = pn.widgets.FloatSlider(name='alpha', value=0.5)

@pn.depends(x.param.value, y.param.value, alpha.param.value)
def get_plot(x_v, y_v, a):
    with plt.xkcd():
        fig = Figure()
        ax = fig.subplots()
        FigureCanvasBase(fig)
        ax.set_xlabel(x_v)
        ax.set_ylabel(y_v)
        ax.scatter(df[x_v], df[y_v], c='blue', alpha = a)
        #fig = df.plot.scatter(x_v, x_v)
        return fig
pn.Column(pn.Row(x, y, alpha), get_plot)

I updated matplotlib version to 3.1.3. From version 3.1.2 matplotlib can make plot with xkcd taste. ;) It’s easy to do it just make plot in with plt.xkcd() line.

Now the scatter plot can interactively select axis and alpha params. I would like to know how to get index of each point and get the value. If I can it I could render molecule when I mouse over the point.

I uploaded today’s code my repo.

https://github.com/iwatobipen/playground/blob/master/Panel_interactive_plot.ipynb

Interactive plot can’t see from githubrepo or nbvier so if reader has interest the package I recommend to try it your own PC. ;)

Example code of DGL for chemoinformatics task #DGL #chemoinformatics #RDKit #memo

There are many publications about graph based approach for chemoinformatics area. I can’t cover all of them but still have interest these area. I think pytorch_geometric (PyG) and deep graph library (DGL) are very attractive and useful package for chemoinformaticians.

I wrote some posts about DGL and PyG. Recent DGL is more chemoinformatics friendly so I used DGL for GCN model building today.

At first I tried to use DGL from Skorch but it failed. So I use DGL only for model building. As usual, I used solubility data for the test. And used RDKit for compound structure handling. I used GCNClassifier.
To use DGL for chemoinformatics area, user should install mdtraj at first. Because mol_to_graph try to import the package but if the code can’t find mdtraj. User can’t get any error message and mol_to_graph related module can’t use.
https://github.com/dmlc/dgl/blob/master/python/dgl/data/chem/utils/mol_to_graph.py

import os
from rdkit import Chem
from rdkit import RDPaths
import numpy as np

import torch
import dgl
if torch.cuda.is_available():
    print('use GPU')
    device='cuda'
else:
    print('use CPU')
    device='cpu'

from dgl.model_zoo.chem import GCNClassifier
from dgl.data.chem.utils import mol_to_graph
from dgl.data.chem.utils import mol_to_complete_graph
from dgl.data.chem import CanonicalAtomFeaturizer
from dgl.data.chem import CanonicalBondFeaturizer
from torch import nn

import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from torch.nn import CrossEntropyLoss

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2
}

After data loaded, rdkit mol objects are converted to graph objects. I used canonical atom featurizer for the task. The features used in the module is the same as deepchem.

It is easy to make graph object with DGL. Just call mol_to_complete_graph. Of course you can use smiles_to_complete_graph for SMILE strings. ;)

atom_featurizer = CanonicalAtomFeaturizer()
n_feats = atom_featurizer.feat_size('h')
print(n_feats)
> 74

ncls = 3
train_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in trainmols]
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_g = [mol_to_complete_graph(m, node_featurizer=atom_featurizer) for m in testmols]
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

OK let’s define GCN model. GCNClassifier takes gcn_hidden_feats argument as list object. If user would like to add n GCN layers, user should pass list with n hidden layers parameters. I added 2 GCN layers with 60 and 20 hidden layers to the following model.

# define GCN NET with 4 GCN layers
gcn_net = GCNClassifier(in_feats=n_feats,
                    gcn_hidden_feats=[60,20],
                    n_tasks=ncls,
                    classifier_hidden_feats=10,
                    dropout=0.5,)
gcn_net = gcn_net.to(device)

Then define the collate function for original dataloader object.

def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    batched_graph.set_n_initializer(dgl.init.zero_initializer)
    batched_graph.set_e_initializer(dgl.init.zero_initializer)
    return batched_graph, torch.tensor(labels)
train_data = list(zip(train_g, train_y))
train_loader = DataLoader(train_data, batch_size=128, shuffle=True, collate_fn=collate, drop_last=True)

Next, train the model. This example is multi label classification task so I used CrossEntropyLoss for loss function.

loss_fn = CrossEntropyLoss()
optimizer = torch.optim.Adam(gcn_net.parameters(), lr=0.01)
gcn_net.train()
epoch_losses = []
epoch_accuracies = []
for epoch in range(1,201):
    epoch_loss = 0
    epoch_acc = 0
    for i, (bg, labels) in enumerate(train_loader):
        labels = labels.to(device)
        atom_feats = bg.ndata.pop('h').to(device)
        atom_feats, labels = atom_feats.to(device), labels.to(device)
        pred = gcn_net(bg, atom_feats)
        loss = loss_fn(pred, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.detach().item()
        pred_cls = pred.argmax(-1).detach().to('cpu').numpy()
        true_label = labels.to('cpu').numpy()
        epoch_acc += sum(true_label==pred_cls) / true_label.shape[0]
    epoch_acc /= (i + 1)
    epoch_loss /= (i + 1)
    if epoch % 20 == 0:
        print(f"epoch: {epoch}, LOSS: {epoch_loss:.3f}, ACC: {epoch_acc:.3f}")
    epoch_accuracies.append(epoch_acc)
    epoch_losses.append(epoch_loss)

> epoch: 20, LOSS: 0.384, ACC: 0.834
> epoch: 40, LOSS: 0.327, ACC: 0.842
> epoch: 60, LOSS: 0.290, ACC: 0.874
> epoch: 80, LOSS: 0.257, ACC: 0.893
> epoch: 100, LOSS: 0.244, ACC: 0.896
> epoch: 120, LOSS: 0.249, ACC: 0.899
> epoch: 140, LOSS: 0.222, ACC: 0.903
> epoch: 160, LOSS: 0.186, ACC: 0.928
> epoch: 180, LOSS: 0.158, ACC: 0.941
> epoch: 200, LOSS: 0.178, ACC: 0.928

Plot learning process with matplot lib.

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.plot([i for i in range(1, 201)], epoch_losses, c='b', alpha=0.6, label='loss')
plt.legend()
plt.plot([i for i in range(1, 201)], epoch_accuracies, c='r', alpha=0.6, label='acc')
plt.legend()
plt.xlabel('epoch')
plt.ylabel('loss/acc')

The learning curve indicates that the process works fine.

In summary DGL is useful package for chemoinformatics. So I hope more chemistry related example codes will be provided from original repository. ;)

Today’s code can be found in my repository and gist.

https://github.com/iwatobipen/playground/blob/master/gcn_dgl_test.ipynb

Use pytorch for QSAR model building more simply like scikit-learn #pytorch #chemoinformatics #RDKit

I often use pytorch for deep learning framework. I like pytorch because it is very flexible and many recent articles are used it for their implementation.

But to build model and train the model, I need to define training method. So it seems nice if I can train pytorch model just calling fit like scikit-learn doesn’t it?

Fortunately by using skorch, it will be enable training process to be simple.

Sktorch is a scikit-learn compatible network library that waps pytorch. Wow it’s cool! Skorch can be installed via pip command.

I installed it and then tried to build QSAR model with skorch.
At first, import packages and make dataset for training and test. For classification task, class index should be float32. float64 will cause Error.

import os
from rdkit import Chem
from rdkit import RDPaths
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np

trainsdf = os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.train.sdf')
testsdf =  os.path.join(RDPaths.RDDocsDir, 'Book/data/solubility.test.sdf')
prop_dict = {
    "(A) low": 0,
    "(B) medium": 1,
    "(C) high": 2
}

def fp2arr(fp):
    arr = np.zeros((0,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr


nbits = 1024
ncls = 3

trainmols = [m for m in Chem.SDMolSupplier(trainsdf)]
testmols = [m for m in Chem.SDMolSupplier(testsdf)]

train_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits) for m in trainmols]
train_X = np.array([fp2arr(fp) for fp in train_fps], dtype=np.float32)

train_y = np.array([ohencoder(prop_dict[m.GetProp('SOL_classification')], ncls) for m in trainmols])
train_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in trainmols])
train_y = np.array(train_y, dtype=np.int64)

test_fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2, nBits=nbits) for m in testmols]
test_X = np.array([fp2arr(fp) for fp in test_fps], dtype=np.float32)

test_y = np.array([ohencoder(prop_dict[m.GetProp('SOL_classification')], ncls) for m in testmols])
test_y = np.array([prop_dict[m.GetProp('SOL_classification')] for m in testmols])
test_y = np.array(test_y, dtype=np.int64)

Then import pytorch and skorch. I checked whether GPU is available or not.

import torch
if torch.cuda.is_available():
    print('use GPU')
    device='cuda'
else:
    print('use CPU')
    device='cpu'

from torch import nn
import torch.nn.functional as F
from skorch import NeuralNetClassifier

Next, define simple multi-layer perceptron model. Following example is multilabel classification so I used softmax layer at the last layer.

class SimpleMLP(nn.Module):
    def __init__(self, num_units=512, nonlin=F.relu):
        super(SimpleMLP, self).__init__()
        self.dense0 = nn.Linear(nbits, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, ncls)
    
    def forward(self, x, **kwargs):
        x = self.nonlin(self.dense0(x))
        x = self.dropout(x)
        x = F.relu(self.dense1(x))
        x = F.softmax(self.output(x), dim=-1)
        return x

Next, make NeuralNetClassifier with defined model. It is very easy to use GPU. If you pass the device=’cuda’ option to the object, that’s all! ;-) After that let’s train the model with fit method.

net = NeuralNetClassifier(SimpleMLP,
                         max_epochs=100,
                         lr=0.1,
                         iterator_train__shuffle=True,
                         device=device
                         )

net.fit(train_X, train_y)

 epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        1.1241       0.3902        1.1102  0.1851
      ----    snip---------
    100        0.0831       0.6146        1.0327  0.0251
Out[10]:
[initialized](
  module_=SimpleMLP(
    (dense0): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=512, out_features=10, bias=True)
    (output): Linear(in_features=10, out_features=3, bias=True)
  ),
)

Ok, let check the model performance.

pred = net.predict(test_X)
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(test_y, pred)
array([[83, 18,  1],
       [35, 72,  8],
       [ 0, 11, 29]])

print(classification_report(test_y, pred))

              precision    recall  f1-score   support

           0       0.70      0.81      0.75       102
           1       0.71      0.63      0.67       115
           2       0.76      0.72      0.74        40

    accuracy                           0.72       257
   macro avg       0.73      0.72      0.72       257
weighted avg       0.72      0.72      0.71       257

It seems work fine.

Then I tried to build regression model. NeuralNetRegressor is used to do it. The model structure is almost same as above but last layer is little bit different.

from skorch import NeuralNetRegressor
class MLP_regressor(nn.Module):
    def __init__(self, num_units=512, nonlin=F.relu):
        super(MLP_regressor, self).__init__()
        self.dense0 = nn.Linear(nbits, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, 10)
        self.output = nn.Linear(10, 1)
    
    def forward(self, x, **kwargs):
        x = self.nonlin(self.dense0(x))
        x = self.dropout(x)
        x = F.relu(self.dense1(x))
        x = self.output(x)
        return x

regressor = NeuralNetRegressor(MLP_regressor,
                              max_epochs=100,
                              iterator_train__shuffle=True,
                              device=device)

Then make dataset for regression task.

train_y_reg = np.array([float(m.GetProp('SOL')) for m in trainmols], dtype=np.float32).reshape(-1,1)
test_y_reg = np.array([float(m.GetProp('SOL')) for m in testmols], dtype=np.float32).reshape(-1,1)
regressor.fit(train_X, train_y_reg)
  epoch    train_loss    valid_loss     dur
-------  ------------  ------------  ------
      1        8.3883       17.1863  0.0306
          ------snip------
    100        0.3729        2.7019  0.0245
[initialized](
  module_=MLP_regressor(
    (dense0): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (dense1): Linear(in_features=512, out_features=10, bias=True)
    (output): Linear(in_features=10, out_features=1, bias=True)
  ),
)

Check the model.

reg_y = regressor.predict(test_X)
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(np.arange(-9,2), np.arange(-9,2), color='pink', alpha=0.7)
plt.scatter(reg_y, test_y_reg)
plt.xlabel('predicted')
plt.ylabel('exp')
from sklearn.metrics import r2_score
print(f"R2 score of testset was {r2_score(test_y_reg, reg_y):.2f}")
> R2 score of testset was 0.65

Current models have room for improvement because I didn’t optimize model structure and parameters. ;)

In summary skorch is very useful package for programmer who would like to write code for model building and training quickly.

I just uploaded today’s code on my githubrepo.

And it can run on binder.

Binder

Make molecule mesh data #RDKit #chemoinformatics #meshlab

I have an interest to predictive model build with 3D compound information. Pytorch3d and open3d seems attractive package for me. However, to use the package, I need to convert molecular information to 3D data such as pointcloud etc.

At first I tried it to use openbabel because recent version of openbabel can convert molecule from MDL molformat to VDW pointcloud xyz format. It worked well but it was difficult to add color information such as atom kind.

So I gave up to use openbabel and shifted pymol.

I installed open source pymol from source code to my python3.7 env. And tried it. Fortunately pymol can save surface 3D data as wrl format. OK sounds good ….. Wait! Unfortunately open3d can’t read wrl format file so I need to convert wrl file to pyl (point cloud format) format.

Finally I found the solution. Use meshlab! It is easy to install meshlab to linux. Just use apt. ;)

$sudo apt -yV install meshlab

Type meshlab in the terminal meshlab application will launch. And the app can convert file format. But I would like to it from python. So I used meshlabserver from subprocess.

meshlabserver is useful for batch process. Following code is an example for reading 3d mesh data generated from rdkit.

At first, load molecule from SMILES and generate 3D conformer. The save it as MDL mol format. Then load file from pymol from library mode.

Then call ‘show surface’ and change quality of surface.

Finally save surface as wrl format. It is easy. Just call ‘save hogehoge.wrl’ in pymol. ;)

import subprocess
import open3d as o3d
from rdkit import Chem
from rdkit.Chem import AllChem
import pymol
mol = Chem.MolFromSmiles('c1ccncc1C')
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=794) # to generate same coordination of molecule
Chem.MolToMolFile(mol, 'mol.mol')
pymol.cmd.load('mol.mol')
pymol.cmd.show('surface')
pymol.cmd.set('surface_quality', '3')
pymol.cmd.save('surface.wrl')

Now I could get molecular surface data as wrl. Then convert the file format. Call meshserver command with -i inputfile -o outputfile and -om vc option.

-om vc means add vertex color information to the outputfile. Following code I would like to get ply format. So -o is surface.ply.

subprocess.call('meshlabserver -i surface.wrl -o surface.ply -om vc'.split())

Readly. Open3d can read ply format. (pytorch3d can read the file format too!)

giom=o3d.io.read_point_cloud('surface.ply')
giom.has_colors()
> True

Open3d has visualization method.

o3d.visualization.draw_geometries([giom])

After calling the command above, another window will open and I can see interactive 3d pointcloud image of query molecule.

I tried several condition of surface quality settings. By using default pymol setting, very sparse data is generated.

Now ready to dive into 3D information based deep learning. ;)

Use RDKit from NIM language #RDKit #Nim #memo

Recently I bought a book which title is ‘Nim in Action’ and started to learn NIM language. I never touched language like a nim-lang which is required compile to run the code.

I had interest about nim-lang because, many documents say NIM is speedy and efficient, and grammar seems like python. And also, rdkit binding is available! The binding is still under development but it’s good news for chemoinformatitian. Thank Axel Pahl for his great work!

Now rdkit-nim offers some chemoinformatics functions. So let’s try to use it.

At first I installed nim-lang. rdkit-nim supports nim version 1.1.1 or higher. This isn’t stable version of nim. I installed choosenim and switch the version to devel.

$ curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Add path to bashrc
# export PATH=/home/username/.nimble/bin:$PATH
# Then switch to devel from stable version.
$ choosenim devel
$ nim -v
Nim Compiler Version 1.1.1 [Linux: amd64]
Compiled at 2020-02-05
Copyright (c) 2006-2019 by Andreas Rumpf

Next, install rdkit nim-rdkit. It is described in README.md of original repository.

#Add rdkit path to bashrc
# export RDKIT_NIM_CONDA=/home/username/anaconda3/envs/myenv

$ git clone https://github.com/apahl/rdkit_nim.git
$ cd rdkit_nim
$ nimble install
$ nimble test
  Executing task test in /home/username/dev/nim_dev/rdkit_nim/rdkit.nimble
    [mol.nim]             passed.
    [qed.nim]             passed.
    [sss.nim]             passed.

    All tests passed.

OK, all test is passed. ;-)

Let’s build simple example which calculate QED of given molecules.

Code is below.

#calc_qed.nim
import rdkit / [mol, qed]
let smiles_list = [
    "CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N",
    "CC(C1=C(C=CC(=C1Cl)F)Cl)OC2=C(N=CC(=C2)C3=CN(N=C3)C4CCNCC4)N"
    ]

for smi in smiles_list:
  echo(smi)
  let m = molFromSmiles(smi)
  let a = qedDefault(m)
  echo("QED")
  echo(a)

Current rdkit-nim supports mol, descriptors, sss and qed modules. The code above import rdkit/mol and rdkit/qed module.

Before build the code, it is required to make nim script file.

#calc_qed.nims

import os # `/`

let
  fileName = "calc_qed"
  condaPath = getEnv("RDKIT_NIM_CONDA")

task build, "Building default cpp target...":
  switch("verbosity", "0")
  # switch("hint[Conf]", "off")
  switch("hints", "off")
  switch("out", "test2/bin" / toExe(fileName))
  switch("run")
  switch("passL", "-lstdc++")
  switch("passL", "-L" & condaPath & "/lib")
  switch("passL", "-lRDKitGraphMol")
  switch("passL", "-lRDKitDescriptors")
  switch("passL", "-lRDKitSmilesParse")
  switch("passL", "-lRDKitChemTransforms")
  switch("passL", "-lRDKitSubstructMatch")
  switch("cincludes", condaPath & "/include/rdkit")
  switch("cincludes", condaPath & "/include")
  setcommand "cpp"

Almost there, let’s compile the code.

$ nim build calc_qed.nim
Hint: used config file '/home/username/.choosenim/toolchains/nim-#devel/config/nim.cfg' [Conf]
Hint: used config file '/home/username/.choosenim/toolchains/nim-#devel/config/config.nims' [Conf]
Hint: used config file '/home/username/dev/nim_dev/rdkit_nim/test2/calc_qed.nims' [Conf]
CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N
QED
0.9284511020269033
CC(C1=C(C=CC(=C1Cl)F)Cl)OC2=C(N=CC(=C2)C3=CN(N=C3)C4CCNCC4)N
QED
0.5330554587335657

After that I got compiled file, test2/bin/calc_qed.

Let’s call the function.

$ ./test2/bin/calc_qed 
CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N
QED
0.9284511020269033
CC(C1=C(C=CC(=C1Cl)F)Cl)OC2=C(N=CC(=C2)C3=CN(N=C3)C4CCNCC4)N
QED
0.5330554587335657

Works fine!

In this code, I embedded molecules as SMILES but if the code load smiles from text files, calculate target can change dynamically. It will be practical apprlcation.

I’m still new for NIM so I would like to learn NIM more and use the package to solve my chemoinformatics task!

Original repo provides many examples user can learn many things from the site. Please check it!

Scaffold growing with RNN #RDKit #Pytorch #Chemoinformatics

My favorite molecular generator is REINVENT which is SMILES RNN based generator. Because it is very flexible and easy to modify.

And recently same group in Astrazeneca published new version of REINVENT, its title is SMILES-Based Deep Generative Scaffold Decorator for De-Novo Drug Design

It seems very exciting for me! Because there are many molecular generator in these days but there are few implementation for scaffold growing. Some approach uses conditional RNN or conditional graph based approach but it can’t specify the position of growing substituents. However their implementation can decorate user defined positions.

Fortunately the code is able to get from github!

Their approach has two model one is scaffold generator and the another is scaffold decorator. Scaffold decorator is trained with parts of compounds which pass the Rule of 3. It is important point because the idea prevents generation of too large undruggable compounds I think.

OK let’s test the code. At first, this code use pyspark so user need to install spark and pyspark at first. And pyspark work java version 8, not work version 11. Please check your java version. And another key point is that attention mechanism is used for decorator model. By using attention mechanism, the model can find the position where attach the substituents.

I think it is very reasonable and efficient approach for scaffold decoration.

You can read more details in original article so I skipped exprain session and go to code!

Are you ready? Let’s go.

Following code is almost same as README.md in original repo. I used quinoline as an example scaffold and used sample trained model for convenience.

$> git clone 
$> cd reinvent-scaffold-decorator
$> echo "c1c([*:0])cc2c(c1)c([*:1])ccn2" > scaffold.smi
$> git clone 
$> cd reinvent-scaffold-decorator
$> echo "c1c([*:0])cc2c(c1)c([*:1])ccn2" > scaffold.smi
$> ./sample_scaffolds.py -m drd2_decorator/models/model.trained.50 -i scaffold.smi -o generated_molecules.parquet -r 32 -n 32 -d multi

The code above will take few minutes in my env (GPU GTX 1650). GPU machine is recommended. After run the code generated molecules are stored as parquet file.

The file can read by using pyspark. The result image is below. You see, all decorated molecule has two substituents with defined positions. I posed scaffold and linker cutting method before. It seems interesting to apply the code I think ;)

And whole code can be found in my gist.

RDKit + deep learning framework is good friend for chemoinformatics! Let’s enjoy!

Cut molecule to ring and linker with RDKit #RDKit #chemoinformatics #memo

Sometime chemists analyze molecule as small parts such as core, linker and substituents.

RDKit has functions for molecular decomposition RECAP, BRICS and rdMMPA. It’s useful functions but these functions can’t extract directly extract core and linker from molecule.

I had interested how to do it and tried it.

Following code, Core is defined Rings in the molecule and Linker is defined chain which connects two rings.

I defined main function named ‘getLinkerbond’ which extracts bonds that connects atom in ring and atom not in ring.

Then call Chem.FragmentOnBonds function to get fragments by cutting linker bond.

Let’s see the code. First, import some libraries. And define test molecules.

import copy
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from IPython.display import display

mol1 = Chem.MolFromSmiles("C1CCNCC1")
mol2 = Chem.MolFromSmiles("c1ccccc1")
mol3 = Chem.MolFromSmiles("C1CC1CC")
mol4 = Chem.MolFromSmiles("C1CC1CCC2CCOC2")
mol5 = Chem.MolFromSmiles("C1CCC2CCCCC2C1")
mol6 = Chem.MolFromSmiles("OC1CC2(C1)CCCC2")
mol7 = Chem.MolFromSmiles("CC1CCN(CC1N(C)C2=NC=NC3=C2C=CN3)C(=O)CC#N")
mol8 = Chem.MolFromSmiles("c1ccccc1c2ccccc2OC")
mol9 = Chem.MolFromSmiles("CC1=C(C=C(C=C1)NC(=O)C2=CC=C(C=C2)CN3CCN(CC3)C)NC4=NC=CC(=N4)C5=CN=CC=C5")
mol10 = Chem.MolFromSmiles("C1CCCC1CCCOCOCOCOCOCc1ccccc1")
mols = [mol1,mol2,mol3,mol4,mol5,mol6,mol7,mol8,mol9,mol10]

Then define main function. I wrote helper function for the main function. is_in_samering function check the atoms which construct a bond are in same ring or not.

def is_in_samering(idx1, idx2, bond_rings):
    for bond_ring in bond_rings:
        if idx1 in bond_ring and idx2 in bond_ring:
            return True
    return False

Following code, I set atom prop at first to keep original atom index. Because GetScaffoldForMol function returns new molecule with new atom index. I would like to use original atom idex in this function.

def getLinkerbond(mol, useScaffold=True):
    res = []
    for atom in mol.GetAtoms():
        atom.SetIntProp("orig_idx", atom.GetIdx())
    for bond in mol.GetBonds():
        bond.SetIntProp("orig_idx", bond.GetIdx())
    
    if useScaffold:
        mol = MurckoScaffold.GetScaffoldForMol(mol)
        
    ring_info = mol.GetRingInfo()
    bond_rings = ring_info.BondRings()
    ring_bonds = set()
    for ring_bond_idxs in bond_rings:
        for idx in ring_bond_idxs:
            ring_bonds.add(idx)
    all_bonds_idx = [bond.GetIdx() for bond in mol.GetBonds()]
    none_ring_bonds = set(all_bonds_idx) - ring_bonds
    for bond_idx in none_ring_bonds:
        bgn_idx = mol.GetBondWithIdx(bond_idx).GetBeginAtomIdx()
        end_idx = mol.GetBondWithIdx(bond_idx).GetEndAtomIdx()
        if mol.GetBondWithIdx(bond_idx).GetBondTypeAsDouble() == 1.0:
            if mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 1:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
                res.append(orig_idx)
            elif not is_in_samering(bgn_idx, end_idx, bond_rings) and mol.GetAtomWithIdx(bgn_idx).IsInRing()+mol.GetAtomWithIdx(end_idx).IsInRing() == 2:
                bond = mol.GetBondWithIdx(bond_idx)
                orig_idx = bond.GetIntProp("orig_idx")
                res.append(orig_idx)
    return res

Check the function.


for mol in mols:
    bonds = getLinkerbond(mol)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)
        display(res)
    else:
        display(mol)

The function cut murckoscaffold structure as default. So bonds not connect rings is not cut.

Next, case of useScaffold False. In this case all bonds outside of rings are cut.

for mol in mols:
    bonds = getLinkerbond(mol, False)
    if bonds:
        res = Chem.FragmentOnBonds(mol, bonds)
        display(res)
    else:
        display(mol)

Works fine.

It is useful for molecular decomposition at linker site I think.

Today’s whole code is uploaded my gist. Any comments and advices will be high appreciated. ;)