Make interactive plot with Knime #RDKit #Chemoinformatics #Knime

Dalia Goldman provided very cool presentation in RDKit UGM 2018 about Knime.
https://github.com/rdkit/UGM_2018/blob/master/Presentations/Goldmann_KNIMEandRDKit.pdf

She demonstrated interactive analysis with RDKit knime node and Javascript node. I was really interested but it was difficult to build the workflow by myself at that time.

BTW, I need to learn knime for data preparation in this week. So I learned about knime and how to make interactive plot with knime.

Following sample is very simple but shows power of knime.
The example is making interactive chemical space plot with knime. All work flow is below. Version of knime is 3.7.

At frist load SDF and calculate descriptors and fingerprint with RDKit node and the split fingerprint with FingerprintExnpander. Then conduct PCA with calculated FP.
Then convert molecule to SVG with openbabel node for visualization.

Key point of this flow is wrapped metanode!
This node is constructed from two JS node ‘Scatter plot’ and ‘Card view’.

After the making metanode, I defined visual layout. The setting can call from right botton of menue bar

And I set card view option as show selected only and several scatter plot option.

Now ready. Then run the work flow! I can view selected compounds from chemical space.
Image is below.

New version of Knime is powerful tool for not only data analysis but also data visualization. ;-)

Advertisements

Make interactive chemical space plot in jupyter notebook #cheminformatics #Altair

I often use seaborn for data visualization. With the library, user can make beautiful visualization.
BTW, today I tried to use another library that can make interactive plot in jupyter notebook.
Name of the library is ‘altair’.
https://altair-viz.github.io/index.html
The library can be installed from pip or conda and this package based vega and vega-lite. Vega is a python package for data visualization.

Regarding the tutorial, Altair can make beautiful plow with very simple code. I wrote an example that plot chemical space of cdk2.sdf which is provided by RDKit.

Following code is conducted in Google colab.
At first install rdkit in my environment. Fortunately Altair can call directly without installation to colab.

!wget https://repo.anaconda.com/miniconda/Miniconda3-4.5.1-Linux-x86_64.sh
!chmod +x Miniconda3-4.5.1-Linux-x86_64.sh
!time bash ./Miniconda3-4.5.1-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

After installation of RDKit, append rdkit path to sys path,

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')

Now ready. Let’s import some libraries.

import pandas as pd
import numpy as np
import altair as alt

from rdkit import Chem
from rdkit import rdBase
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import PandasTools
from rdkit.Chem import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA

Then load dataset

moldf = PandasTools.LoadSDF(os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf'))
moldf['SMILES'] = moldf.ROMol.apply(Chem.MolToSmiles)
def mol2fparr(mol):
    arr = np.zeros((0,))
    fp = AllChem.GetMorganFingerprintAsBitVect(mol,2)
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

The conduct PCA with molecular fingerprint for plot chemical space.
And make new dataset. cdk2.sdf has Cluster number, so I used the annotation for coloring.

pca = PCA(n_components=2)
X = np.asarray([mol2fparr(mol) for mol in moldf.ROMol])
print(X.shape)
res = pca.fit_transform(X)
print(res.shape)
moldf['PCA1'] = res[:,0]
moldf['PCA2'] = res[:,1]
moldf2 = moldf[['ID', 'PCA1', 'PCA2', 'SMILES' ]]
moldf2['Cluster'] = ["{:0=2}".format(int(cls)) for cls in moldf.loc[:,'Cluster']]

To make scatter plot in Altair, it is easy just call ‘alt.Cahrt.mark_point(pandas data frame)’
mark_* is the method which can access many kids of plots.
From the document, following plots are provided.

Mark Name Method Description Example
area mark_area() A filled area plot. Simple Stacked Area Chart
bar mark_bar() A bar plot. Simple Bar Chart
circle mark_circle() A scatter plot with filled circles. One Dot Per Zipcode
geoshape mark_geoshape() A geographic shape Choropleth Map
line mark_line() A line plot. Simple Line Chart
point mark_point() A scatter plot with configurable point shapes. Multi-panel Scatter Plot with Linked Brushing
rect mark_rect() A filled rectangle, used for heatmaps Simple Heatmap
rule mark_rule() A vertical or horizontal line spanning the axis. Candlestick Chart
square mark_square() A scatter plot with filled squares. N/A
text mark_text() A scatter plot with points represented by text. Bar Chart with Labels
tick mark_tick() A vertical or horizontal tick mark. Simple Strip Plot

Now I would like to make scatter plot, so I call mark_point.
“interactive()” method returns interactive plot in jupyter.
So After run the code, I can see interactive plot in notebook the plot returns tooltip when mouse over the point.

alt.Chart(moldf2).mark_point().encode(
           x = 'PCA1',
           y = 'PCA2',
           color = 'Cluster',
           tooltip = ['ID', 'SMILES']).interactive()

This library is interesting for me because it is easy to implement tooltip. I tried to embed SVG image to tooltip but it did not work. I would like to know how to embed image to the tooltip if it possible.

How to visualize your data is important because it receives different impression from different visualization even if data is same.

Reader who is interested in the post can found whole code from google colab or github. ;-)
https://colab.research.google.com/drive/1hKcWRBcQG51eGsbpDBF2gl6CoZsmVvTs
https://github.com/iwatobipen/playground/blob/master/plot_chemicalspace.ipynb

Build stacking Classification QSAR model with mlxtend #chemoinformatics #mlxtend #RDKit

I posed about the ML method named ‘blending’ somedays ago. And reader recommended me that how about try to use “mlxtend”.
When I learned ensemble learning package in python I had found it but never used.
So try to use the library to build model.
Mlxtend is easy to install and good document is provided from following URL.
http://rasbt.github.io/mlxtend/

Following code is example for stacking.
In ipython notebook…
Use base.csv for test and load some functions.

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
import pandas as pd

df = pd.read_csv("bace.csv")

The dataset has pIC50 for objective value.

mols = [Chem.MolFromSmiles(smi) for smi in df.mol]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol,2, nBits=1024) for mol in mols]
pIC50 = [i for i in df.pIC50]
Draw.MolsToGridImage(mols[:10], legends=["pIC50 "+str(i) for i in pIC50[:10]], molsPerRow=5)

Images of compounds is below.


Then calculate molecular fingerprint. And make binary activity array as y_bin.

X = []
for fp in fps:
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    X.append(arr)
X = np.array(X)
y = np.array(pIC50)
y_bin = np.asarray(y>7, dtype=np.int)

Then load some classifier model from sklearn and split data for training and testing.

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from mlxtend.classifier import StackingClassifier
from mlxtend.plotting import plot_decision_regions
from mlxtend.plotting import plot_confusion_matrix
import numpy as np
x_train, x_test, y_train, y_test = train_test_split(X,y_bin, test_size=0.2)

To make stacking classifier, it is very simple just call StackingClassifier and set classifier and meta_classifier as arguments.
I use SVC as meta_classifier.

clf1 = RandomForestClassifier(random_state=794)
clf2 = GaussianNB()
clf3 = XGBClassifier(random_state=0)
clf4 = SVC(random_state=0)
clflist = ["RF", "GNB", "XGB", "SVC", "SCLF"]

sclf = StackingClassifier(classifiers=[clf1,clf2,clf3], meta_classifier=clf4)

Then let’s learn the data!

skf = StratifiedKFold(n_splits=5)
for j, (train_idx,test_idx) in enumerate(skf.split(x_train, y_train)):
    for i, clf in enumerate([clf1, clf2, clf3, clf4, sclf]):
        clf.fit(x_train[train_idx],y_train[train_idx])
        ypred = clf.predict(x_train[test_idx])
        acc = accuracy_score(y_train[test_idx], ypred)
        b_acc = balanced_accuracy_score(y_train[test_idx], ypred)
        print("round {}".format(j))
        print(clflist[i])
        print("accuracy {}".format(acc))
        print("balanced accuracy {}".format(b_acc))
        print("="*20)

> output
round 0
RF
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
====================
round 0
GNB
accuracy 0.6625514403292181
balanced accuracy 0.680450351191296
====================
round 0
XGB
accuracy 0.8271604938271605
balanced accuracy 0.8136275995042005
====================
round 0
SVC
accuracy 0.7325102880658436
balanced accuracy 0.7072717256576229
====================
round 0
SCLF
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
====================
round 1
RF
accuracy 0.7603305785123967
balanced accuracy 0.7534683684794672
====================
round 1
GNB
accuracy 0.640495867768595
balanced accuracy 0.6634988901220866
====================
round 1
XGB
accuracy 0.8140495867768595
balanced accuracy 0.8127081021087681
====================
round 1
SVC
accuracy 0.756198347107438
balanced accuracy 0.7414678135405106
===================
.....

Reader who is interested in stacking, you can find nice document here
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#example-1-simple-stacked-classification

And my all code is uploaded to myrepo on github.
http://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/mlxtend_test.ipynb

Mlxtend has many functions for building, analyzing and visualizing machine learning model and data. I will use the package more and more.

Change properties of approved oral drugs

When I learned drug discovery long time ago, I read the article about Role of five which is a rule of thumb to evaluate druglikeness.
https://www.sciencedirect.com/science/article/pii/S0169409X00001290?via%3Dihub
You can read nice review about the druglikess scores in following URL.
( Written in Japanese ;-) )
View story at Medium.com

By the way, recently there are many articles which are describing about beyond the Role of five . I read an article published from JMC, reported by researcher of NIB.
https://www.ncbi.nlm.nih.gov/pubmed/30212196

The author analyzed the change over time of properties such as MW, logP, HBA, HBD, TPSA and NumAromaticRing.
In page B, the author reported comparison between experimental and calculated (StarDrop, Pomona, Moka, Crippen) value of LogP which is determined in Novartis. And the experiments shows that StarDrop gave the lowest variability between measured and calculated logP. Most of tools overestimate in high log P area but stardrop does not. It worth to know that clogp is largely affected by computational tools.

In Table5 shows analysis of FDA approved oral NCEs from 1998 to 2017. Mw, RotBm and #ArRNG showed significance.
For example 90th percentile of Mw is 571.0 compared to 1997 90th percentile 470.3. It dramatically increased!
And also 90th percentile of #ArRNG is 4 compared to 1997 90th percentile 3.
Mw is changed dramatically I think. Fig 6 shows Number of ratio of MW in approved drug in each time period. This figure shows tendency to decrease of ratio of the drug which has MW <= 500.
By the way, HBD is not changed.

The author discussed about why size (MW) is matter and described that higher molecular weight is not be indicator of druglikeness but be estimator of synthetic accessibility.
To synthesize molecule with high molecular weight, it needs more building blocks and / or longer synthetic steps. So it is needed more effort to SAR expansion in medicinal chemistry projects.

Ro5 is easy to understand and reasonable role for medicinal chemistry but it is changing. Recently there are many modalities for drug discovery.
Changing role is needed long time and many efforts…..
More details are described in the article. It is informative for me.

Vote Vote Vote #chemoinformatics

Somedays ago, I posted about ensemble classification method named ‘blending’. The method is not implemented in scikit-learn. So I am implementing the function now.
By the way, scikit-learn has an ensemble classification method named ‘VotingClassifer’.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Following explanation from sklearn document.

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

The classifier can combine many classifiers very easily.
The function has two modes, one is hard and the other is soft.
From document…
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

I used the classifier for QSAR modeling.

Following code, I used four classifier and BASE.csv from molecule net as test dataset.
Code is very simple! Just pass defined dictionary to VotingClassifier.

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

voting_clf = VotingClassifier(estimators=[("RF", clf_dict["RF"]),
                                        ("GBC", clf_dict["GBC"]),
                                        ("XGB", clf_dict["XGB"]),
                                        ("SVC", clf_dict["SVC"])        
                                    ], voting='hard')

dataset = np.load("train.npz")['arr_0']
X = dataset[:,:-1]
y = dataset[:,-1]
idx = np.random.permutation(y.size)
X = X[idx]
y = y[idx]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)
nfolds = 10
skf = StratifiedKFold(nfolds)
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(test_X)
print("Voting!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

rf_clf = RandomForestClassifier(n_estimators=100)
 
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(test_X)
print("Random Forest!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

svc_clf = SVC(probability=True, gamma='auto')
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    svc_clf.fit(X_train, y_train)
y_pred = svc_clf.predict(test_X)
print("SV!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred)) 

Then run the code.
In this example voting method does not outperform to random forest, support vector classifier. But it is worth to know that sklearn provides useful feature for ensemble learning I think. ;-)

iwatobipen$ python voting.py 
Voting!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Random Forest!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

SV!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Visualize pharmacophore in RDKit #RDKit

RDKit has pharmacophore feature assignment function. The function can retrieve molecular features based on pre-defined ph4core.
And RDKit IPythonconsole can draw molecules on ipython notebook.
Today I tried to visualize ph4core on notebook.
Code is very simple.

from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import rdBase
from rdkit.RDPaths import RDDocsDir
from rdkit.RDPaths import RDDataDir
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import os
print(rdBase.rdkitVersion)
IPythonConsole.ipython_useSVG=True
> 2018.09.1

First, load feature definition.

fdefFile = os.path.join(RDDataDir,'BaseFeatures.fdef')
featFact = ChemicalFeatures.BuildFeatureFactory(fdefFile)

Then calculate pharmacophore. And compute 2D cordes.

mols = [m for m in Chem.SDMolSupplier(os.path.join(RDDocsDir,"Book/data/cdk2.sdf"))]
featslists = [featFact.GetFeaturesForMol(mol) for mol in mols]
for mol in mols:
    AllChem.Compute2DCoords(mol)

Next I defined drawing function. To highlight ph4core, highlightAtomLists and legends are used as optional arguments of MolsToGridImage.

def drawp4core(mol, feats):
    atoms_list = {}
    for feat in feats:
        atom_ids = feat.GetAtomIds()
        feat_type = feat.GetType()
        atoms_list[feat_type] = atom_ids
    return Draw.MolsToGridImage([mol]*len(atoms_list), legends=list(atoms_list.keys()), highlightAtomLists=list(atoms_list.values()))

Test the function.

im = drawp4core(mols[1], featslists[1])
im

I could get following image.

The function worked and it could pick up pharmacophore of given molecule. ;-)

Applicable Domain on Deep Neural Networks #JCIM #chemoinformatics

I read interesting article from JCIM.
Dissecting Machine-Learning Prediction of Molecular Activity: Is an
Applicability Domain Needed for Quantitative Structure−Activity
Relationship Models Based on Deep Neural Networks?

URL is below.
https://pubs.acs.org/doi/10.1021/acs.jcim.8b00348

The pros of DNN is feature extraction. And there are many articles which use DNN for molecular activity prediction. BTW, is it true that DNN is outperform any other machine learning methods?

The authors of the article analyzed the performance of DNN. They used ECFP4 as an input feature and predicted biological activities extracted from CHEMBL DB.
Their approach was reasonable, they built model with training set and check the performance with test data and evaluate RMSE in several layers which are defined by molecular similarity. Layer1 means that dataset is similar to training data and Layer6 means that dataset is not similar to training set.

They analyzed performance of predictive method such as KNN, RF and DNN and their analysis revealed that DNN showed similar performance with RF and KNN. And also Fig 5 shows that DNN can not predict objective value when query molecule is not similar to training set.
It indicates that DNN does not learn feature of molecule from finger print but learned pattern of fingerprint.
More details are described in this article.
In the real drug discovery project, MedChem sometime designs not seed similar compounds. For the chemist, this is not so special. But it is difficult point for AI to learn sense of MedChem.

Biological activity prediction is challenging area I think, And ECFP is still de fact standard for chemoinformatics. I would like to develop new concept of molecular descriptors.