Call Knime from Jupyter notebook! #Chemoinformatics #RDKit #Knime

I read exiting blog post yesterday! URL is below.
@dr_greg_landrum developed very cool tools which can call knime from jupyter notebook and can execute jupyter notebool from knime.

Details of the tool is described in the Knime blog post. I am interested the tool and I can’t wait to try it in myself. So I used it from my mac book pro. At first I installed python knime package via pypi.

iwatobipen$ pip install knime

Now ready. I tried to make sample work flow. My workflow receives SMILES strings from jupyter and calculates RDKit descriptors, normalize then and return the result to notebook.

After that, I build regression model for solubility and apply it to test data. Dataset is supplied from rdkit Book/data folder.

OK, let’s go to the code. Following code is referenced Gregs blog post URL is above. First import packages.

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import style
import os
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import RDConfig
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score
import knime

Next, prepare dataset for training and test. Type of data which passes Knime is pandas dataframe.

train_path = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
test_path = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')

train_mols = [m for m in Chem.SDMolSupplier(train_path)]
train_y = np.asarray([m.GetProp('SOL') for m in train_mols], dtype=np.float32)
train_table = {'smiles':[Chem.MolToSmiles(m) for m in train_mols]}
train_df = pd.DataFrame(train_table)

test_mols =  [m for m in Chem.SDMolSupplier(test_path)]
test_y = np.asanyarray([m.GetProp('SOL') for m in test_mols], dtype=np.float32)
test_table = {'smiles':[Chem.MolToSmiles(m) for m in test_mols]}
test_df = pd.DataFrame(test_table)

Next, define Knime executable path and workspace path. My env is Mac so it is a little bit different to original blog post.

#My Knime env uploaded to 3.7 from 3.6.
knime.executable_path = '/Applications/KNIME'
workspace = '/Users/iwatobipen/knime-workspace/'

Then check workflow. I made following workflow in advance. And I could see image of the WF on notebook. ;)

workflow = 'jupyter_integration'
knime.Workflow(workflow_path=workflow, workspace_path=workspace)

Now ready, let’s run the WF for descriptor calculation!

# training data
with knime.Workflow(workflow_path=workflow, workspace_path=workspace) as wf:
    wf.data_table_inputs[0] = train_df
train_x = wf.data_table_outputs[0]

# test data
with knime.Workflow(workflow_path=workflow, workspace_path=workspace) as wf:
    wf.data_table_inputs[0] = test_df
test_x = wf.data_table_outputs[0]

I could get dataset for build regression model and test. So I fit SVR of sklearn and test the model performance.

svr = SVR()
svr.gamma = 'auto', train_y)

pred = svr.predict(test_x)
print(r2_score(test_y, pred))
print(mean_squared_error(test_y, pred))

It seems not so bat. Check the performance with matplotlib.

a, b = min(test_y), max(test_y)
data = np.linspace(a,b, num=100)
plt.scatter(pred, test_y, c='b', alpha=0.5)
plt.plot(data, data, c='r')

The model can predict solubility of test molecules with high accuracy. In summary, integration Knime and Jupyter notebook has high potential for chemoinformatics I think. Because jupyter has flexibility and knime is powerful tool for routine work.
Whole code can view from following URL.

Make interactive plot with Knime #RDKit #Chemoinformatics #Knime

Dalia Goldman provided very cool presentation in RDKit UGM 2018 about Knime.

She demonstrated interactive analysis with RDKit knime node and Javascript node. I was really interested but it was difficult to build the workflow by myself at that time.

BTW, I need to learn knime for data preparation in this week. So I learned about knime and how to make interactive plot with knime.

Following sample is very simple but shows power of knime.
The example is making interactive chemical space plot with knime. All work flow is below. Version of knime is 3.7.

At frist load SDF and calculate descriptors and fingerprint with RDKit node and the split fingerprint with FingerprintExnpander. Then conduct PCA with calculated FP.
Then convert molecule to SVG with openbabel node for visualization.

Key point of this flow is wrapped metanode!
This node is constructed from two JS node ‘Scatter plot’ and ‘Card view’.

After the making metanode, I defined visual layout. The setting can call from right botton of menue bar

And I set card view option as show selected only and several scatter plot option.

Now ready. Then run the work flow! I can view selected compounds from chemical space.
Image is below.

New version of Knime is powerful tool for not only data analysis but also data visualization. ;-)

Visualize chemical space using Knime rdkit node

Usually I use python for analyse, visualize chemical space. Because, I love coding. ;-)
I know, work flow tool is useful solution to do that.

So, I tried to plot chemical space using Knime. Knime is one of famous work flow tool and lots of nodes are developed.

I made very simple work flow to do PCA. My work flow is following.

At first, the flow read smiles strings from excel file. And convert smies to RDKit molecule.
Then calculate morgan FP using RDKit Finger printer. You know, the node can also calculate various FP like MACCS, topological etc.
Next, extend bit vector to 1024 bit columns.
And do PCA and make scatter plot. The plotting node is implemented in Erlwood chemoinformatics node.
When I call view scatter plot, I got following dynamic scatter plot.
scatter plot
The node can select each columns easily and user can set color or size own criteria. And visualize structure as label. Wow cool!

And I set activity cliff viewer.
The node needs two parameter, one of smiles and another is distance matrix of similarity.
N x N distance matrix is generated using distance matrix calculate node.
Finally run the flow, I got network view of activity cliffs.
Screen Shot 2016-08-24 at 11.28.47 PM
Edges that are colored green are indicated activity cliffs. ( in my case delta pIC50 >= 1.0 and similarity >= 0.5 )
Hmm but the image seems to difficult to understand SAR. Cytoscape is suitable tool to visualize network.
Mistake ???

Activity cliffs table seems good.

Knime is powerful tool for medchem.



R <- plsr(pIC50~., data=R, ncomp=20, validation="LOO")