Make interactive dashboard with Dash2 #chemoinformatcs #RDKit

Still I am playing with dash. And I would like to write document about py4chemoinfomatics in this weekend (tonight).

Before I posted about how to make interactive 2D plot with Dash. And today I added file up loader and structure renderer to the code.

All code can find following URL.
https://github.com/iwatobipen/chemo_dash

For usage in closed environment, I make local css server in the code. Dash uses flask for web framework so, user can easy to make additional page like flask. Here is whole code of the demo.
I defined smiles to svg image and supply the SVG string to dash_dangerously_set_inner_html (dhtml). DHTML is not secure because the function can directly embed HTML tag to html.Div. I used the method because I would like to draw mol SVG on the fly.

import flask
import base64
import io
import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_dangerously_set_inner_html as dhtml
import dash_table
from dash.dependencies import Input, Output

import os
import argparse
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import Draw
from rdkit.Chem.Draw import rdDepictor
from sklearn.decomposition import PCA

import plotly.graph_objs as go
descdict = dict(Descriptors._descList)

app = dash.Dash(__name__)

css_directory = os.getcwd()+'/static/'
stylesheets = ['bWLwgP.css']
static_css_route = '/static/'

def smi2svg(smi):
    mol = Chem.MolFromSmiles(smi)
    rdDepictor.Compute2DCoords(mol)
    mc = Chem.Mol(mol.ToBinary())
    Chem.Kekulize(mc)
    drawer = Draw.MolDraw2DSVG(300,300)
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText().replace('svg:','')
    return svg

def parse_contents(contents):
    content_type, content_string = contents.split(',')
    decoded = base64.b64decode(content_string)
    with open('./static/temp.sdf', 'w') as f:
        f.write(decoded.decode('utf-8'))
    mols = [mol for mol in Chem.SDMolSupplier('./static/temp.sdf')]
    return mols

@app.server.route('{}'.format(static_css_route))
def serve_stylesheet(stylesheet):
    if stylesheet not in stylesheets:
        raise Exception()
    return flask.send_from_directory(css_directory, stylesheet)

for stylesheet in stylesheets:
    app.css.append_css({"external_url": "/static/{}".format(stylesheet)})


app.layout = html.Div(children=[
    html.H1(children='Hello Chemoinfo'),
    dcc.Upload(
        id='upload-data',
        children=html.Div(
            ['Drag and Drop SDF or ',
            html.A('Select Files')
            ]),
            style={
                    'width': '80%',
                    'height': '60px',
                    'lineHeight': '60px',
                    'borderWidth': '1px',
                    'borderStyle': 'dashed',
                    'borderRadius': '5px',
                    'textAlign': 'center',
                    'margin': '10px'
            },
            multiple=True
        ),
    
    html.Div(children='''
    Dash : sample plot
    '''),

    html.Div([dcc.Dropdown(id='x-column',
                           value='MaxEStateIndex',
                           options=[{'label': key, 'value': key} for key in descdict.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
              dcc.Dropdown(id='y-column',
                           value='MaxEStateIndex',
                           options=[{'label': key, 'value': key} for key in descdict.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
                           ]),
    html.Div([
        html.Div([html.Div(id="molimg")], className="four columns"),
        html.Div([dcc.Graph(id='example-graph')], className="eight columns")
        ], className="row"),
    #html.Div([dcc.Graph(id='chemical-space')])
    ])

@app.callback(
    Output('example-graph', 'figure'),
     
    [Input('upload-data', 'contents'),
     Input('x-column', 'value'),
     Input('y-column', 'value')]
)
def update_graph(contents, x_column_name, y_column_name):
    mols = parse_contents(contents[0])
    for i, mol in enumerate(mols):
        AllChem.Compute2DCoords(mol)
    x = [descdict[x_column_name](mol) for mol in mols]
    y = [descdict[y_column_name](mol) for mol in mols]

    return {'data':[go.Scatter(
        x=x,
        y=y,
        #text=['mol_{}'.format(i) for i in range(len(mols))],
        text=[Chem.MolToSmiles(mol) for mol in mols],
        mode='markers',
        marker={
            'size':15,
            'opacity':0.5
        }
    )],
    'layout':go.Layout(
        xaxis={'title':x_column_name},
        yaxis={'title':y_column_name}
    )}


@app.callback(
    Output('molimg', 'children'),
    [Input('example-graph', 'hoverData'),
    ]
)
def update_img(hoverData1):
    try:
        svg = smi2svg(hoverData1['points'][0]['text'])
    except:
        svg = 'Select molecule'
    return dhtml.DangerouslySetInnerHTML(svg)

if __name__=='__main__':
    app.run_server(debug=True)

Let’s run the app

iwatobipen$ python app.py

Official document supplies some examples for multiple type of file uploading. My code is not smart because I could not handle SDMolSupplier with StringIO so I saved temp.sdf and reload it. Any advice and comments are appreciated about how to handle uploaded sdf dynamically in the web app. And also I would like to know how to handle multiple out put in call back function.

Advertisements

Make interactive dashboard with Dash #chemoinformatcs #RDKit

I am happy that I could have fruitful discussions in mishima.syk #13 held on last Saturday.
And I knew the new package for data visualization named dash.
From the document, dash can provide interactive dashboard for data analysis. I read the documents and wrote sample code.

Following code provides 2D scatter plot with two descriptors which are calculated with RDKit. Callback function of Dash can make interactive response.

It is worth to know that by using dash, I can make cool web app with out JS coding!

import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output

import os
import argparse
from rdkit import Chem
from rdkit.Chem import Descriptors
import plotly.graph_objs as go
descdict = dict(Descriptors._descList)
app = dash.Dash(__name__)

def getParser():
    parser = argparse.ArgumentParser()
    parser.add_argument('--sdf', default='cdk2.sdf', type=str)
    return parser

app.layout = html.Div(children=[
    html.H1(children='Hello Chemoinfo'),
    html.Div(children='''
    Dash : sample plot
    '''),

    html.Div([dcc.Dropdown(id='x-column',
                           value='MaxEStateIndex',
                           options=[{'label': key, 'value': key} for key in descdict.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
              dcc.Dropdown(id='y-column',
                           value='MaxEStateIndex',
                           options=[{'label': key, 'value': key} for key in descdict.keys()],
                           style={'width':'48%', 'display':'inline-block'}),
                           ]),
    dcc.Graph(id='example-graph'),
    ])

@app.callback(
    Output('example-graph', 'figure'),
    [Input('x-column', 'value'),
     Input('y-column', 'value')]
)
def update_graph(x_column_name, y_column_name):
    args = getParser().parse_args()
    mols = [mol for mol in Chem.SDMolSupplier(args.sdf) if mol != None]
    x = [descdict[x_column_name](mol) for mol in mols]
    y = [descdict[y_column_name](mol) for mol in mols]
    return {'data':[go.Scatter(
        x=x,
        y=y,
        text=[Chem.MolToSmiles(mol) for mol in mols],
        mode='markers',
        marker={
            'size':15,
            'opacity':0.5
        }
    )],
    'layout':go.Layout(
        xaxis={'title':x_column_name},
        yaxis={'title':y_column_name}
    )}
if __name__=='__main__':
    app.run_server(debug=True)

Let’s run the app

iwatobipen$ python app.py

You can see interactive plot. This is very simple example for dash usage. I would like to make more content rich dashboard for chemoinformatics near the future.

Try GCN QSPR with pytorch based graph library #RDKit #Pytorch #dgl

Recently many machine learning articles use pytorch for their implementation. And I found very attractive package for graph based deep learning, named ‘DGL;Deep Graph Library’. The package supports pytorch and mxnet for backend. The author provides not only package but also very nice documentation. I read the document and try GCN for QSPR with DGL.

https://www.dgl.ai/pages/about.html

The package can make graph object from networkx and can convert graph object to networkx object.
Following code example is based their example about batched graph convolution.
https://docs.dgl.ai/tutorials/basics/4_batch.html
At first, import packages which will use.

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from collections import namedtuple
import dgl
from dgl import DGLGraph
import dgl.function as fn

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, TensorDataset
from torch.utils.data import DataLoader
import torch.optim as optim

import networkx as nx
import copy
import os
from rdkit import Chem
from rdkit.Chem import RDConfig
import numpy as np

Next define mol2graph function. Following example only node features used bond features did not used.

# Following code borrowed from dgl's junction tree example.
ELEM_LIST = ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg', 'Na', 'Ca', 'Fe', 'Al', 'I', 'B', 'K', 'Se', 'Zn', 'H', 'Cu', 'Mn', 'unknown']

ATOM_FDIM = len(ELEM_LIST) + 6 + 5 + 1
MAX_ATOMNUM =60
BOND_FDIM = 5 
MAX_NB = 10

PAPER = os.getenv('PAPER', False)

def onek_encoding_unk(x, allowable_set):
    if x not in allowable_set:
        x = allowable_set[-1]
    return [x == s for s in allowable_set]

# Note that during graph decoding they don't predict stereochemistry-related
# characteristics (i.e. Chiral Atoms, E-Z, Cis-Trans).  Instead, they decode
# the 2-D graph first, then enumerate all possible 3-D forms and find the
# one with highest score.
'''
def atom_features(atom):
    return (torch.Tensor(onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()]))
'''
def atom_features(atom):
    return (onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()])

def bond_features(bond):
    bt = bond.GetBondType()
    return (torch.Tensor([bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE, bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC, bond.IsInRing()]))

def mol2dgl_single(mols):
    cand_graphs = []
    n_nodes = 0
    n_edges = 0
    bond_x = []

    for mol in mols:
        n_atoms = mol.GetNumAtoms()
        n_bonds = mol.GetNumBonds()
        g = DGLGraph()        
        nodeF = []
        for i, atom in enumerate(mol.GetAtoms()):
            assert i == atom.GetIdx()
            nodeF.append(atom_features(atom))
        g.add_nodes(n_atoms)

        bond_src = []
        bond_dst = []
        for i, bond in enumerate(mol.GetBonds()):
            a1 = bond.GetBeginAtom()
            a2 = bond.GetEndAtom()
            begin_idx = a1.GetIdx()
            end_idx = a2.GetIdx()
            features = bond_features(bond)

            bond_src.append(begin_idx)
            bond_dst.append(end_idx)
            bond_x.append(features)
            bond_src.append(end_idx)
            bond_dst.append(begin_idx)
            bond_x.append(features)
        g.add_edges(bond_src, bond_dst)
        g.ndata['h'] = torch.Tensor(nodeF)
        cand_graphs.append(g)
    return cand_graphs

Next defined the original collate function for data loader. And defined msg and reduce function.
msg function get message from neighbor node and reduce function aggregates the massages.

msg = fn.copy_src(src="h", out="m")
def collate(sample):
    graphs, labels = map(list,zip(*sample))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)
def reduce(nodes):
    # summazation by avarage is different part
    accum = torch.mean(nodes.mailbox['m'], 1)
    return {'h': accum}

Then defined the network. By using the dgl user can easily access node and features. For example graph.ndata[‘name’] method can access node features named ‘name’. NodeApplyModule is used for calculation of each node.
It is worth to know that by using torch nn.ModuleList, user can write network like ‘Keras’.

class NodeApplyModule(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(NodeApplyModule, self).__init__()
        self.linear = nn.Linear(in_feats, out_feats)
        self.activation = activation
    
    def forward(self, node):
        h = self.linear(node.data['h'])
        h = self.activation(h)
        return {'h': h}
    

class GCN(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(GCN, self).__init__()
        self.apply_mod = NodeApplyModule(in_feats, out_feats, activation)
    
    def forward(self, g, feature):
        g.ndata['h'] = feature
        g.update_all(msg, reduce)
        g.apply_nodes(func=self.apply_mod)
        h =  g.ndata.pop('h')
        #print(h.shape)
        return h
    
class Classifier(nn.Module):
    def __init__(self, in_dim, hidden_dim, n_classes):
        super(Classifier, self).__init__()
        self.layers = nn.ModuleList([GCN(in_dim, hidden_dim, F.relu),
                                    GCN(hidden_dim, hidden_dim, F.relu)])
        self.classify = nn.Linear(hidden_dim, n_classes)
    def forward(self, g):
        h = g.ndata['h']
        for conv in self.layers:
            h = conv(g, h)
        g.ndata['h'] = h
        hg = dgl.mean_nodes(g, 'h')
        return self.classify(hg)

Let’s load data. I used solubility dataset in RDKit

solcls = {'(A) low':0, '(B) medium':1, '(C) high':2}
train_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.train.sdf'))]
train_y = [solcls[m.GetProp('SOL_classification')] for m in train_mols]
test_mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/solubility.test.sdf'))]
test_y = [solcls[m.GetProp('SOL_classification')] for m in test_mols]
train_graphs = mol2dgl_single(train_mols)
test_graphs = mol2dgl_single(test_mols)

dataset = list(zip(train_graphs, train_y))
data_loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate)

Finally run train and check the performance.

model = Classifier(ATOM_FDIM, 256, len(solcls))
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()

epoch_losses = []
for epoch in range(200):
    epoch_loss = 0
    for i, (bg, label) in enumerate(data_loader):
        bg.set_e_initializer(dgl.init.zero_initializer)
        bg.set_n_initializer(dgl.init.zero_initializer)        
        pred = model(bg)
        loss = loss_func(pred, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.detach().item()
    epoch_loss /= (i + 1)
    if (epoch+1) % 20 == 0:
        print('Epoch {}, loss {:.4f}'.format(epoch+1, epoch_loss))
    epoch_losses.append(epoch_loss)

>Epoch 20, loss 0.6104
>Epoch 40, loss 0.5616
>Epoch 60, loss 0.5348
>Epoch 80, loss 0.5095
>Epoch 100, loss 0.4915
>Epoch 120, loss 0.5163
>Epoch 140, loss 0.5348
>Epoch 160, loss 0.4385
>Epoch 180, loss 0.4421
>Epoch 200, loss 0.4318
plt.plot(epoch_losses, c='b')
model.eval()
test_bg = dgl.batch(test_graphs)
test_y_tensor = torch.tensor(test_y).float().view(-1,1)
test_bg.set_e_initializer(dgl.init.zero_initializer)
test_bg.set_n_initializer(dgl.init.zero_initializer)
logit = model(test_bg)
probs = torch.softmax(logit, 1).detach().numpy()
pred_y = np.argmax(probs,1)

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
accuracy_score(test_y, pred_y)
>0.7587548638132295
print(classification_report(test_y, pred_y))
              precision    recall  f1-score   support

           0       0.70      0.86      0.78       102
           1       0.79      0.64      0.71       115
           2       0.87      0.82      0.85        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.79      0.78      0.78       257
weighted avg       0.77      0.76      0.76       257

Hmm not so bad. OK I tried random forest next. I would like to use RDKit new descriptor, ‘dragon-type descriptor’. So I used it ;)

https://github.com/rdkit/UGM_2018/blob/master/Notebooks/Landrum_Whats_New.ipynb

from rdkit.Chem import AllChem
from rdkit.Chem.Descriptors import rdMolDescriptors
from sklearn.preprocessing import normalize
# generate 3D conf
train_mols2 = copy.deepcopy(train_mols)
test_mols2 = copy.deepcopy(test_mols)

ps = AllChem.ETKDGv2()
for m in train_mols2:
    m = Chem.AddHs(m)
    AllChem.EmbedMolecule(m,ps)
for m in test_mols2:
    m = Chem.AddHs(m)
    AllChem.EmbedMolecule(m,ps)
def calc_dragon_type_desc(mol):
    return rdMolDescriptors.CalcAUTOCORR3D(mol) + rdMolDescriptors.CalcMORSE(mol) + \
        rdMolDescriptors.CalcRDF(mol) + rdMolDescriptors.CalcWHIM(mol)
train_X = normalize([calc_dragon_type_desc(m) for m in train_mols2])
test_X = normalize([calc_dragon_type_desc(m) for m in test_mols2])

For convenience I use only one conformer the above code.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(train_X, train_y)
rf_pred_y = rfc.predict(test_X)
accuracy_score(test_y, rf_pred_y)
>0.7587548638132295
print(classification_report(test_y, rf_pred_y))
              precision    recall  f1-score   support

           0       0.77      0.87      0.82       102
           1       0.79      0.66      0.72       115
           2       0.67      0.75      0.71        40

   micro avg       0.76      0.76      0.76       257
   macro avg       0.74      0.76      0.75       257
weighted avg       0.76      0.76      0.76       257

Random Forest showed same performance with GCN in this test.
DGL is very useful package for graph based deeplearning I think. Their original repo provides many example codes! Reader who is interested in pls check the repo.

I uploaded today’s code my git hub and it can check following URL.

https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/GCN_chemo.ipynb

Draw HOMO LUMO with psikit-2 #chemoinformatics

Yesterday I wrote a post about psikit function for HOMO LUMO drawing. And @fmkz__ gave me very suggestive comment.

He performed QM calculation about tetrazole and disclosed the its distribution of negative charge.

I had interested in his suggestion and I tried it. By using psikit, I calculated energy of acetic acid and 5-methyl-1H-tetrazole and got HOMO LUMO cube files from the results.

Results is below. At first HOMO and LUMO of acetic acid.

HOMO of acetic acid
LUMO of acetic acid

These results shows that carbonyl C=O has large MO and it means carbonyl is most reactive position of the acetic acid. And also LUMO around the alpha carbon is large. It indicates that the site is reactive with electrophile I think.

Next, let’s see result of tetrazole. Tetrazole is often used in medicinal chemistry as a bioisoster of carboxylic acid.

HOMO of tetrazole
LUMO of tetrazole

These results shows 3-N position has large MO compared to other nitrogen atoms. I think it indicates that 3-N position is more reactive than other nitrogen. And I found good example to explain the result in slide share. URL is below.

https://www.slideshare.net/ShrikantPharande/tetrazole-and-triazole-67637231
Page 21 of the slide shows example of 5-sub tetrazoles alkylation.
The page shows example of reaction between 5-Me-1H-tetrazole and trityl-Cl. And most of alkylation occurred at n-3 position.

Borrowed from slide share

It is very exciting for me because I could know that QM is very useful tool for reactivity prediction. The result motivates me to enhance of psikit. ;-)

Any suggestion and advices are highly appreciated.

Draw HOMO LUMO with psikit #RDKit #Psi4 #PyMol

Now I and @fmkz___ are developing a thin wrapper library for Psi4 and RDKit named psikit.
Today I added new function for viewing molecular orbital HOMO and LUMO with pymol. Psi4 has the function which can output cube format file of MO named ‘cubeprop’. So I try to implement the function in to psikit.
By using the library, you can get HOMO/LUMO cube file easily.
Example is below.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from psikit import Psikit
smi = 'COc1cccnc1'
mol = Chem.MolFromSmiles(smi)
AllChem.Compute2DCoords(mol)
mol

Just call ‘getMOview’ after optimization. It took a little bit long time on my PC.

pk = Psikit()
pk.read_from_smiles(smi)
%time pk.optimize()
>Optimizer: Optimization complete!
>CPU times: user 1min 54s, sys: 3.97 s, total: 1min 58s
>Wall time: 31.5 s
>-360.58490537900167
# get MO view!
pk.getMOview()

After executing the method, psikit makes 5 files named target.mol, Psi_a_n_n-A.cube, Psi_b_n_n-A.cube, Psi_a_n+1_n+1-A.cube, Psi_b_n+1_n+1-A.cube. First file is mol file. Second and third is information of its HOMO and forth, fifth is information of LUMO.

Then load all files from pymol. Change HOMO/LUMO object show setting as dot. I could get following image.

Optimized molecule
Molecule and HOMO
Molecule and LUMO
Molecule and HOMO/LUMO

The function and pymol seem to work fine. But I want to compare data between this result and another QM’s result.
Current version of psikit is still under development and some bugs. I would like to fix and tune the code.

Call Knime from Jupyter notebook! #Chemoinformatics #RDKit #Knime

I read exiting blog post yesterday! URL is below.
https://www.knime.com/knime-blog-general
@dr_greg_landrum developed very cool tools which can call knime from jupyter notebook and can execute jupyter notebool from knime.

Details of the tool is described in the Knime blog post. I am interested the tool and I can’t wait to try it in myself. So I used it from my mac book pro. At first I installed python knime package via pypi.

iwatobipen$ pip install knime

Now ready. I tried to make sample work flow. My workflow receives SMILES strings from jupyter and calculates RDKit descriptors, normalize then and return the result to notebook.

After that, I build regression model for solubility and apply it to test data. Dataset is supplied from rdkit Book/data folder.

OK, let’s go to the code. Following code is referenced Gregs blog post URL is above. First import packages.

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
import os
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
import knime
style.use('ggplot')

Next, prepare dataset for training and test. Type of data which passes Knime is pandas dataframe.

train_path = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
test_path = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')

train_mols = [m for m in Chem.SDMolSupplier(train_path)]
train_y = np.asarray([m.GetProp('SOL') for m in train_mols], dtype=np.float32)
train_table = {'smiles':[Chem.MolToSmiles(m) for m in train_mols]}
train_df = pd.DataFrame(train_table)

test_mols =  [m for m in Chem.SDMolSupplier(test_path)]
test_y = np.asanyarray([m.GetProp('SOL') for m in test_mols], dtype=np.float32)
test_table = {'smiles':[Chem.MolToSmiles(m) for m in test_mols]}
test_df = pd.DataFrame(test_table)

Next, define Knime executable path and workspace path. My env is Mac so it is a little bit different to original blog post.

#My Knime env uploaded to 3.7 from 3.6.
knime.executable_path = '/Applications/KNIME 3.6.1.app/Contents/MacOS/Knime'
workspace = '/Users/iwatobipen/knime-workspace/'

Then check workflow. I made following workflow in advance. And I could see image of the WF on notebook. ;)

workflow = 'jupyter_integration'
knime.Workflow(workflow_path=workflow, workspace_path=workspace)

Now ready, let’s run the WF for descriptor calculation!

# training data
with knime.Workflow(workflow_path=workflow, workspace_path=workspace) as wf:
    wf.data_table_inputs[0] = train_df
    wf.execute()
train_x = wf.data_table_outputs[0]

# test data
with knime.Workflow(workflow_path=workflow, workspace_path=workspace) as wf:
    wf.data_table_inputs[0] = test_df
    wf.execute()
test_x = wf.data_table_outputs[0]

I could get dataset for build regression model and test. So I fit SVR of sklearn and test the model performance.

svr = SVR()
svr.gamma = 'auto'
svr.fit(train_x, train_y)

pred = svr.predict(test_x)
print(r2_score(test_y, pred))
print(mean_squared_error(test_y, pred))
>0.703289693583441
>1.2055499665537373

It seems not so bat. Check the performance with matplotlib.

a, b = min(test_y), max(test_y)
data = np.linspace(a,b, num=100)
plt.scatter(pred, test_y, c='b', alpha=0.5)
plt.plot(data, data, c='r')
plt.xlabel('pred')
plt.ylabel('actual')

The model can predict solubility of test molecules with high accuracy. In summary, integration Knime and Jupyter notebook has high potential for chemoinformatics I think. Because jupyter has flexibility and knime is powerful tool for routine work.
Whole code can view from following URL.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/Knime_test.ipynb

Plot Chemical space with d3js based library #RDKit #Chemoinformatics

Recently I posted making interactive plot on jupyter notebook.
https://iwatobipen.wordpress.com/2018/12/09/make-interactive-chemical-space-plot-in-jupyter-notebook-cheminformatics-altair/
I used altair for doing it. Today, I used d3js and matplotlib based package to make scatter plot.
mlpd3 is another tool for making interactive plot with python.
https://mpld3.github.io/index.html

In the original site, many examples are provided. It seems easy to make any kinds of plot with tooltip. ;) So, I want to try whether the module can SVG image as tooltip.
From original site document mlpd3 supports following version of python. The mpld3 project is compatible with Python 2.6-2.7 and 3.3-3.4. But I want to run my code on python3.6. So I installed mpld3 to py3.6 env.
And current version of mpld3 causes error. I referred SOF and installed bug fixed version from github repo.


python -m pip install --user "git+https://github.com/javadba/mpld3@display_fix"

Now ready. Let’s write code! To run the code on jupyter notebook, mpld3.enable_notebook() option is recommended.

%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mpld3
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import rdFingerprintGenerator
from rdkit.Chem import DataStructs
from sklearn.decomposition import PCA
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import rdMolDraw2D
from mpld3 import plugins
mpld3.enable_notebook()

Then perform PCA with morgan fingerprint. This is not so difficult. And make moltosvg function for tooltip. I browed the function from rdkit blog post. I used svg strings as tooltip.

def fp2arr(fp):
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp,arr)
    return arr

# Original code is described in the rdkit blog post.
# http://rdkit.blogspot.com/2015/02/new-drawing-code.html

def moltosvg(mol,molSize=(450,15),kekulize=True):
    mc = Chem.Mol(mol.ToBinary())
    if kekulize:
        try:
            Chem.Kekulize(mc)
        except:
            mc = Chem.Mol(mol.ToBinary())
    if not mc.GetNumConformers():
        rdDepictor.Compute2DCoords(mc)
    drawer = rdMolDraw2D.MolDraw2DSVG(molSize[0],molSize[1])
    drawer.DrawMolecule(mc)
    drawer.FinishDrawing()
    svg = drawer.GetDrawingText()
    return svg.replace('svg:','')

fpgen = rdFingerprintGenerator.GetMorganGenerator(2)
mols = [m for m in Chem.SDMolSupplier(os.path.join(RDConfig.RDDocsDir,'Book/data/cdk2.sdf'))]
for m in mols:
    AllChem.Compute2DCoords(m)
fps = [fpgen.GetFingerprint(m) for m in mols]
X = np.asarray([fp2arr(fp) for fp in fps])

pca = PCA(n_components=3)
res = pca.fit_transform(X)
# make set of SVG
svgs = [moltosvg(m) for m in mols]

Finally make scatter plot. This can do with almost same way to matplotlib. To use mlpd3, user do not need to write Javascript for making tooltip. Of course you can write your own javascript for implementation of more complex features .

fig, ax = plt.subplots()
ax.set_xlabel('PCA1')
ax.set_ylabel('PCA2')
ax.set_title('Viz chemical space!')
points = ax.scatter(res[:,0], res[:,1])
# This is key point for making tooltip!
tooltip = plugins.PointHTMLTooltip(points, svgs)
plugins.connect(fig, tooltip)

After running the code, I could get interactive plot which has chemical structures as a tooltip. ;)
I would like to make frame work for chemical space visualizer.

Scatter plot

Whole code can view from nb viewer.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/tooltip_test.ipynb