Make original sklearn classifier #sklearn #chemoinfo

I posted and wrote code about ‘blending’ which is one of the strategy for ensemble learning. But the code had many hard coded part so it was difficult to use in my job. In this post, I tried to make new classification class of sklearn for ensemble learning and test the code.

At first, most of part came from my old post.
To make blending classifier, the class succeed to BaseEstimator, ClassifierMixin and TransformerMixin. And defined fit, predict and predict_proba method.

Main code is below.

from sklearn.base import BaseEstimator
from sklearn.base import ClassifierMixin
from sklearn.base import TransformerMixin
from sklearn.base import clone
import numpy as np
import six
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

class BlendingClassifier(BaseEstimator, ClassifierMixin):
    
    def __init__(self, l1_clfs, l2_clf, 
                 n_hold=5, test_size=0.2, verbose=0,
                 use_clones=True, random_state=794):
        self.l1_clfs = l1_clfs
        self.l2_clf = l2_clf

        self.n_hold = n_hold
        self.test_size = test_size
        self.verbose = verbose
        self.use_clones = use_clones
        self.random_state = random_state
        self.num_cls = None

    def fit(self, X, y):
        skf = StratifiedKFold(n_splits=self.n_hold, random_state=self.random_state)
        if self.use_clones:
            self.l1_clfs_ = [clone(clf) for clf in self.l1_clfs]
            self.l2_clf_ = clone(self.l2_clf)
        else:
            self.l1_clfs_ = self.l1_clfs
            self.l2_clf_ = self.l2_clf

        self.num_cls = len(set(y))
        if self.verbose > 0:
            print("Fitting {} l1_classifiers...".format(len(self.l1_clfs)))
            print("{} classes classification".format(self.num_cls))
        
        dataset_blend_train = np.zeros((X.shape[0],len(self.l1_clfs_), self.num_cls))

        for j, clf in enumerate(self.l1_clfs_):
            for i, (train_idx, test_idx) in enumerate(skf.split(X, y)):
                if self.verbose > 0:
                    print('{}-{}th hold, {} classifier'.format(j+1, i+1, type(clf)))
                train_i_x, train_i_y  = X[train_idx], y[train_idx]
                test_i_x, test_i_y = X[test_idx], y[test_idx]
                clf.fit(train_i_x, train_i_y)
                dataset_blend_train[test_idx, j, :] = clf.predict_proba(test_i_x)
 
        if self.verbose > 0:
            print('--- Blending ---')
            print(dataset_blend_train.shape)
        
        dataset_blend_train = dataset_blend_train.reshape((dataset_blend_train.shape[0], -1))
        self.l2_clf_.fit(dataset_blend_train, y)
        return self

    def predict(self, X):
        l1_output = np.zeros((X.shape[0], len(self.l1_clfs_), self.num_cls))
        for i, clf in enumerate(self.l1_clfs_):
            pred_y = clf.predict_proba(X)
            l1_output[:, i, :] = pred_y
        l1_output = l1_output.reshape((X.shape[0], -1))
        return self.l2_clf_.predict(l1_output)

    def predict_proba(self, X):
        l1_output = np.zeros((X.shape[0], len(self.l1_clfs_), self.num_cls))
        for i, clf in enumerate(self.l1_clfs_):
            pred_y = clf.predict_proba(X)
            l1_output[:, i, :] = pred_y
        l1_output = l1_output.reshape((X.shape[0], -1))
        return self.l2_clf_.predict_proba(l1_output)

Now ready, let’s test the code! The code works like this …

from blending_classification import BlendingClassifier

rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
et = ExtraTreesClassifier(n_estimators=100, n_jobs=-1)
gbc = GradientBoostingClassifier(learning_rate=0.01)
xgbc = XGBClassifier(n_estimators=100, n_jobs=-1)
# To use SVC, probability option must be True
svc = SVC(probability=True, gamma='auto')
# The class is two layers blending classifier.
# layer one is set of classifier.
# layer two is final classifier which uses output of layer one.

l1_clfs = [rf, et, gbc, xgbc]
l2_clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
blendclf = BlendingClassifier(l1_clfs, l2_clf, verbose=1)

blendclf.fit(train_X, train_y)
pred_y = blendclf.predict(test_X)
print(classification_report(test_y, pred_y))
              precision    recall  f1-score   support

           0       0.76      0.79      0.78       102
           1       0.69      0.61      0.65       115
           2       0.56      0.70      0.62        40

   micro avg       0.70      0.70      0.70       257
   macro avg       0.67      0.70      0.68       257
weighted avg       0.70      0.70      0.70       257
cm = confusion_matrix(test_y, pred_y)
plot_confusion_matrix(cm)

Next, run same task with RandomForest.

mono_rf = RandomForestClassifier(n_estimators=100, n_jobs=-1) mono_rf.fit(train_X, train_y) pred_y2 = mono_rf.predict(test_X)  print(classification_report(test_y, pred_y2))
              precision    recall  f1-score   support

           0       0.81      0.81      0.81       102
           1       0.77      0.75      0.76       115
           2       0.72      0.78      0.75        40

   micro avg       0.78      0.78      0.78       257
   macro avg       0.77      0.78      0.77       257
weighted avg       0.78      0.78      0.78       257
cm2 = confusion_matrix(test_y, pred_y2)
plot_confusion_matrix(cm2)

Hmm…… RandomForest shows better performance than blending classifier. Ensemble method seems return robust model I think but it is needed to parameter tuning. (I know it same for every machine learning method ;-) )

Any way, I wrote blending classifier class today and up load the code to my repo.
Any comments and/or suggestions are appreciated.

*All code can found following repo.
https://github.com/iwatobipen/skensemble

*And note book can see following URL.
https://nbviewer.jupyter.org/github/iwatobipen/skensemble/blob/master/solubility.ipynb

Two days to 2019!

Analysis and visualize tool kit for FMO #FMO

I and my daughter got the flu last week and now we are staying in my home…
Now I read some articles and found interesting work for FMO.
URL is below.
https://pubs.acs.org/doi/10.1021/acs.jcim.8b00649
FMO means ‘Fragment Molecular Orbital’ that is powerful method for protein-ligand interaction energy calculation. Evotec which is a drug discovery alliance and development partnership company published many attractive articles about utilization of FMO for Drug design.
Some of examples are …
https://pubs.acs.org/doi/abs/10.1021/acs.jmedchem.6b00045
https://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00644
https://link.springer.com/protocol/10.1007%2F978-1-4939-7465-8_8
FMO is useful but lack of efficient analytical tool, it is difficult to understand the results for none specialist I think.

The authors developed new analysis and visualization tool kit for FMO RbAnalysisFMO and Pymol plugin written by Ruby and Python. Both tool kit s are freely available.
http://dfns.u-shizuoka-ken.ac.jp/labs/proeng/custom20.html/

This tool kit can analyze two modes, ‘All-pairs’ and ‘Selected-pairs’. All-pairs mode can analyze all pair IEIE or PIE and ‘Selected-piars’ mode can analyze IEIE or PIE with user defined amino acid residues. And also it can analyze PIEDA (pair interaction energy decomposition analysis). PIEDA decomposes total interaction energy to four components “ES/electro static”, “DS/dispersion”, “CT/charge transfer” and “EX/exchange”. It is useful to understand of details of the interaction so visualize these component is useful.

The authors showed two examples for the tool kit usage.
One is the case of BC2L-C. They conducted FMO between BC2L-C and seleno methylated fucose(SFU), and revealed new interaction that can not indicated as an interacting residue of SFU in the PDB record.

Understanding protein-ligand interaction is key for drug design so efficient analytical and visualization tools are important. The tool kit looks useful not only computational chemist but also medicinal chemist.

メドケムxAI 創薬化学者の今後は明るいのか? #souyakuAC2018

これは"創薬アドベントカレンダー2018" 20日目の記事になります。

去年もサイエンス色0の駄文を書いたiwatobipenです。今年もエントリーしたもののネタがないなーと思って色々彷徨った結果、論文を紹介しつつ自分の業務周りにフォーカスしようということにしました。(またサイエンスじゃないのかよ!)
今回紹介するのはメドケム〜合成メドケムっぽいネタ。
まずは、Drug Discovery Todayから
Medicinal chemistry in drug discovery in big pharma: past, present and future
内容はタイトルの通り、GSKにて長年メドケムをやられてきた著者らによる大手製薬企業(海外)のこれまでとこれからですに関する記事です。

論文中のTable1”summary of the topics covered from the past to the present and then the future.”から何個か抜粋してみます。

以下の文章のリストは過去=>現在=>これからの順で書いています。なお、あくまで大きな製薬企業の例なので国内の中規模以下の製薬企業には当てはまらないことも多いと思います。

# 合成
- ハイスキルの人材がマニュアルで実施
- 50%は派遣スタッフや委託
- 90%以上は派遣スタッフや委託

# 合成反応
- 基本的な反応セット
- 同じ反応セット+Pd反応
- 現在の反応セット+CHActivationやBioconversion

# テクノロジー
- ホットプレートで攪拌、Evap
- +マイクロ波、パラレル反応キット
- ?

# Leads
- 様々なソースから
- Role of 5によるDrug like(経口投与を意識した)な構造から
- Role of 5を超えたスペース、様々な投与経路

# Compounds design and SAR
- 論文ベース、暗黙知、経験に基づく試行錯誤
- in silicoツールを活用して効率的に進める。
- より人がやるよりコンピューターが行うようなり、データを活用した洗練された手法になる。

# What medicinal Chemists do
- ラボでの合成が主、創薬化学が少し
- 合成より、創薬化学寄りに。薬探索者のフィールドはラボからオフィスへ
- 創薬探索と真のスペシャリストは基本オフィスへ。

# Communication
- Face to face がメイン
- 電子媒体の利用が増え、会わないでも議論できるように
- SNSのようなツールでリアルタイムにどこでも議論。

。。。論文中はもっと様々なトピックスが予測されています。

 私は、大学で勉強してきた有機合成の知識とスキルを生かし、新しい分子を合成することを仕事にと思って今の職につきましたが、昨今の製薬企業において、合成だけでを生業にしていきていくといのは難しくなっているのかもしれません。極論を言えば、お金を出せば、有機合成化学者がいなくても化合物が作れるわけです(うまく行くとはいってないですよ。)。様々な化合物にまつわるデータを解析し、SARを考え次の構造展開をデザインする創薬化学者のタスクに関しても、人の考えているタスクの多くがコンピュータに変わるのではないかと予想されています。
 文中では”Compound design and SAR: a transition from art to a process?”というセッションで詳細が述べられております。Drug design/SARは同じデータを見ても人により異なるデザインが出てくる部分であり、想像力が求められる部分もあるかと思います。しかし、コンピューターによって効率的に行われるようになるというのはあながち遠い未来の話ではないのかもしれません。同じくArtとも捉えられてきた有機合成経路探索に関しては、すでにAIが現場に投入されていますね。
https://www.chemistryworld.com/news/merck-kgaa-to-buy-chematica/3007276.article
ちなみに、これは有料のサービスとして他の企業も使えるようになりました。これを対外的にも利用できるようにしたのは、ビジネスとして成り立つクオリティと判断したからなのかな、実力が気になります。
https://www.sigmaaldrich.com/chemistry/chemical-synthesis/synthesis-software.html


 色々なサービスや技術進歩により創薬化学者の仕事はだいぶ変わりそうです。その一方で意外と昔から変わっていないのは有機合成です。
 今後、勢い衰えずComputer scienceがさらに発展し、デザインがAIによってなされるようになった時SARの案はあるものの、物作りが間に合わないといったことがありえます。
 マイクロ波、フロー合成、ナノスケール合成、など新しいテクノロジーは出てきていますが、やっぱり今だにナスフラスコ+スタラーバー、ワークアップは分液ロート振って、エバポしてカラム、(できれば再結晶)、などの作業をやることも多いのではないでしょうか。この辺に対しての取り組みの論文を何個か。
 
 ちょっと古い論文ですが下記のタイトルでLillyの研究者らが、WebベースのUIを備えたリモート合成ラボの報告をしています。この文献によれば、研究員は自分のオフィスから必要な反応条件などを入れるとあとはフルオートでロボットが化合物合成を進めてくれるようです。2013年の論文です。その時もびっくりしましたが今見ても驚きです。ここまでの設備を保有している製薬企業はそんなにないでしょう。Robot化のメリットは効率という観点のみではなく、危険な反応、化学物質の研究者への暴露を避けるといった安全面から見ても有用であると思います。
A remote-controlled adaptive medchem lab: an innovative approach to enable drug discovery in the 21st Century
Globalで一拠点上記のようなラボがあるとして、Logisticsなども含め化合物の合成開始から受領までのターンオーバーがSARサイクルの時間に乗ってくるのかは興味があるところです。

 上記のように合成を完全に自動化したという話の次は、デザイン〜合成〜評価全てを閉鎖系でぐるぐるやるという話の紹介です。
Cyclofluidicという企業はUCB/Pfizerの出資により設立された企業で、CycloOpsというテクノロジーでリード最適化を効率的に実行します。最近Publish論文は下記から読めます。
Design, Synthesis, and Testing of Potent, Selective Hepsin Inhibitors via Application of an Automated Closed-Loop Optimization Platform

Abstract Image
 CycloOpsの肝はフロー合成という技術です。これを利用し化合物を合成し、そのまま評価系に流し込みます。フロー合成とはこれまでの古典的なナスフラスコでのバッチ合成と異なり、細い流路の中に試薬を流し込み反応させるテクノロジーです。バッチ合成の課題点を克服できるポテンシャルを秘めており大変注目されています。同社のテクノロジーは、先のLillyの例と比較すると創薬のPDCAサイクル、DMTAサイクルがぐるぐる回るといった点は大変魅力的です。一方で装置の流路が詰まるとどうにもならないので、使える試薬や反応に一定の制限があります。

 2つ事例を見てきましたが、冒頭の#Leadsの部分にあるように近年Beyond Ro5ということで、これまで以上に広い探索スペースから化合物を探索するようになっています。ですので化合物の物性素性もさらに多様性が増しているはずです。多様な化合物合成全てに適用可能な全自動合成マシンの登場はもう少し時間がかかりそうです、(私の予想では)。ただ、比較的汎用的な部分は自動化すぐいけるんじゃないかなと期待しています。ペプチドや核酸はもう自動合成装置がありますしね。
Robotics個人的に熱いところです。詳しい方興味ある方いれば是非いろいろ勉強させていただきたい。

デザイン〜合成の部分にフォーカスしてつらつら書いてみました。
で、まとめると
1、メドケムの業務の多くを引き受けることができるCROがたくさんあるし価格競争力、技術力も高い。
2、CompChemの台頭によりSARの解析、化合物デザインもAIができるようになってきている。合成経路なども提案できる仕組みもできてきた。
3、合成も自動化しているところもある。

 業務の外注化に関しては、賛否両論あります。特に最初はいいけど企業の継続性を考えた場合に、技術や知識、人材育成の面ではマイナスになりうるのではというものです。私もここは同意なのですが、これは実は日本のような終身雇用が基本のスタイルだからなのかなと最近考えています。
海外は雇用の流動性が高いのでファーマ=>CRO、CRO=>ファーマといったキャリアパスが多くあり意外と空洞化って起こらないのかなと(私見)。
 また、私は自分の今の業務のデザイン、合成がコンピューター、ロボットに変わるのは問題全くないと思っています。まあ、私的に固執する理由はありませんしね。勿論それが新薬創出の加速になるのならですけど。
 病で苦しむ患者の皆様に1日も早く新薬を届ける。という目的を達成するために目指すべきはなんなんでしょう。Drug designer・Drug hunterなどと表現されることもあった創薬化学者の行うべき仕事内容は今後どうなるのでしょう。
などと解答のない文章を書いておしまいにしたいと思います。

 5年後メドケムの仕事がどうなっているのか、私には正直想像つきません。皆さんどう思われますか。

もうすぐクリスマス!素敵なクリスマスが皆様に来ますように!


 

Make interactive MMP network with Knime #Knime #chemoinformatics

I posted an example that shows making interactive scatter plot with Knime. And I would like to try MMP network with Knime. I often make network view via python package such as igraph, networkx and py2cytoscape etc. BTW, today I want to learn how to do that on knime.

Recent version of Knime is provided several JS visualization nodes. So, I used Network viewrer(Js) node. And sample data is got from my githubrepo. https://github.com/iwatobipen/mmp_example

Now ready, I made very simple MMP generation workflow with Erlwood chemoinformatics node. The details of following screen shot is … 1 load data, 2 convert smiles to mol, 3 convert mol to rdkit mol obj, 4 Conduct MMP.

Workflow-sc1

And Node8 Network creator makes empty network and Node 9 object inserter binds MMP data as network. I mapped ID-L and ID-R for node paring and Set row ID as Edge ID.

Now making network is done. Then make interactive JS plot in wrapped meta node. Screen shot of expanded meta node is below.

Two openbabel nodes convert moleculeL and moleculeR to SVG image from SMILES. Then connect Network viewer node and Table View.

Finally after running the flow, I could get interactive network view and Compound pair is shown when edges are selected.

Make interactive plot with Knime #RDKit #Chemoinformatics #Knime

Dalia Goldman provided very cool presentation in RDKit UGM 2018 about Knime.

Click to access Goldmann_KNIMEandRDKit.pdf

She demonstrated interactive analysis with RDKit knime node and Javascript node. I was really interested but it was difficult to build the workflow by myself at that time.

BTW, I need to learn knime for data preparation in this week. So I learned about knime and how to make interactive plot with knime.

Following sample is very simple but shows power of knime.
The example is making interactive chemical space plot with knime. All work flow is below. Version of knime is 3.7.

At frist load SDF and calculate descriptors and fingerprint with RDKit node and the split fingerprint with FingerprintExnpander. Then conduct PCA with calculated FP.
Then convert molecule to SVG with openbabel node for visualization.

Key point of this flow is wrapped metanode!
This node is constructed from two JS node ‘Scatter plot’ and ‘Card view’.

After the making metanode, I defined visual layout. The setting can call from right botton of menue bar

And I set card view option as show selected only and several scatter plot option.

Now ready. Then run the work flow! I can view selected compounds from chemical space.
Image is below.

New version of Knime is powerful tool for not only data analysis but also data visualization. ;-)

Make interactive chemical space plot in jupyter notebook #cheminformatics #Altair

I often use seaborn for data visualization. With the library, user can make beautiful visualization.
BTW, today I tried to use another library that can make interactive plot in jupyter notebook.
Name of the library is ‘altair’.
https://altair-viz.github.io/index.html
The library can be installed from pip or conda and this package based vega and vega-lite. Vega is a python package for data visualization.

Regarding the tutorial, Altair can make beautiful plow with very simple code. I wrote an example that plot chemical space of cdk2.sdf which is provided by RDKit.

Following code is conducted in Google colab.
At first install rdkit in my environment. Fortunately Altair can call directly without installation to colab.

!wget https://repo.anaconda.com/miniconda/Miniconda3-4.5.1-Linux-x86_64.sh
!chmod +x Miniconda3-4.5.1-Linux-x86_64.sh
!time bash ./Miniconda3-4.5.1-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

After installation of RDKit, append rdkit path to sys path,

import sys
import os
sys.path.append('/usr/local/lib/python3.6/site-packages/')

Now ready. Let’s import some libraries.

import pandas as pd
import numpy as np
import altair as alt

from rdkit import Chem
from rdkit import rdBase
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import PandasTools
from rdkit.Chem import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

from sklearn.decomposition import PCA

Then load dataset

moldf = PandasTools.LoadSDF(os.path.join(RDConfig.RDDocsDir, 'Book/data/cdk2.sdf'))
moldf['SMILES'] = moldf.ROMol.apply(Chem.MolToSmiles)
def mol2fparr(mol):
    arr = np.zeros((0,))
    fp = AllChem.GetMorganFingerprintAsBitVect(mol,2)
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

The conduct PCA with molecular fingerprint for plot chemical space.
And make new dataset. cdk2.sdf has Cluster number, so I used the annotation for coloring.

pca = PCA(n_components=2)
X = np.asarray([mol2fparr(mol) for mol in moldf.ROMol])
print(X.shape)
res = pca.fit_transform(X)
print(res.shape)
moldf['PCA1'] = res[:,0]
moldf['PCA2'] = res[:,1]
moldf2 = moldf[['ID', 'PCA1', 'PCA2', 'SMILES' ]]
moldf2['Cluster'] = ["{:0=2}".format(int(cls)) for cls in moldf.loc[:,'Cluster']]

To make scatter plot in Altair, it is easy just call ‘alt.Cahrt.mark_point(pandas data frame)’
mark_* is the method which can access many kids of plots.
From the document, following plots are provided.

Mark Name Method Description Example
area mark_area() A filled area plot. Simple Stacked Area Chart
bar mark_bar() A bar plot. Simple Bar Chart
circle mark_circle() A scatter plot with filled circles. One Dot Per Zipcode
geoshape mark_geoshape() A geographic shape Choropleth Map
line mark_line() A line plot. Simple Line Chart
point mark_point() A scatter plot with configurable point shapes. Multi-panel Scatter Plot with Linked Brushing
rect mark_rect() A filled rectangle, used for heatmaps Simple Heatmap
rule mark_rule() A vertical or horizontal line spanning the axis. Candlestick Chart
square mark_square() A scatter plot with filled squares. N/A
text mark_text() A scatter plot with points represented by text. Bar Chart with Labels
tick mark_tick() A vertical or horizontal tick mark. Simple Strip Plot

Now I would like to make scatter plot, so I call mark_point.
“interactive()” method returns interactive plot in jupyter.
So After run the code, I can see interactive plot in notebook the plot returns tooltip when mouse over the point.

alt.Chart(moldf2).mark_point().encode(
           x = 'PCA1',
           y = 'PCA2',
           color = 'Cluster',
           tooltip = ['ID', 'SMILES']).interactive()

This library is interesting for me because it is easy to implement tooltip. I tried to embed SVG image to tooltip but it did not work. I would like to know how to embed image to the tooltip if it possible.

How to visualize your data is important because it receives different impression from different visualization even if data is same.

Reader who is interested in the post can found whole code from google colab or github. ;-)
https://colab.research.google.com/drive/1hKcWRBcQG51eGsbpDBF2gl6CoZsmVvTs
https://github.com/iwatobipen/playground/blob/master/plot_chemicalspace.ipynb

Build stacking Classification QSAR model with mlxtend #chemoinformatics #mlxtend #RDKit

I posed about the ML method named ‘blending’ somedays ago. And reader recommended me that how about try to use “mlxtend”.
When I learned ensemble learning package in python I had found it but never used.
So try to use the library to build model.
Mlxtend is easy to install and good document is provided from following URL.
http://rasbt.github.io/mlxtend/

Following code is example for stacking.
In ipython notebook…
Use base.csv for test and load some functions.

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
import numpy as np
import pandas as pd

df = pd.read_csv("bace.csv")

The dataset has pIC50 for objective value.

mols = [Chem.MolFromSmiles(smi) for smi in df.mol]
fps = [AllChem.GetMorganFingerprintAsBitVect(mol,2, nBits=1024) for mol in mols]
pIC50 = [i for i in df.pIC50]
Draw.MolsToGridImage(mols[:10], legends=["pIC50 "+str(i) for i in pIC50[:10]], molsPerRow=5)

Images of compounds is below.


Then calculate molecular fingerprint. And make binary activity array as y_bin.

X = []
for fp in fps:
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    X.append(arr)
X = np.array(X)
y = np.array(pIC50)
y_bin = np.asarray(y>7, dtype=np.int)

Then load some classifier model from sklearn and split data for training and testing.

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from mlxtend.classifier import StackingClassifier
from mlxtend.plotting import plot_decision_regions
from mlxtend.plotting import plot_confusion_matrix
import numpy as np
x_train, x_test, y_train, y_test = train_test_split(X,y_bin, test_size=0.2)

To make stacking classifier, it is very simple just call StackingClassifier and set classifier and meta_classifier as arguments.
I use SVC as meta_classifier.

clf1 = RandomForestClassifier(random_state=794)
clf2 = GaussianNB()
clf3 = XGBClassifier(random_state=0)
clf4 = SVC(random_state=0)
clflist = ["RF", "GNB", "XGB", "SVC", "SCLF"]

sclf = StackingClassifier(classifiers=[clf1,clf2,clf3], meta_classifier=clf4)

Then let’s learn the data!

skf = StratifiedKFold(n_splits=5)
for j, (train_idx,test_idx) in enumerate(skf.split(x_train, y_train)):
    for i, clf in enumerate([clf1, clf2, clf3, clf4, sclf]):
        clf.fit(x_train[train_idx],y_train[train_idx])
        ypred = clf.predict(x_train[test_idx])
        acc = accuracy_score(y_train[test_idx], ypred)
        b_acc = balanced_accuracy_score(y_train[test_idx], ypred)
        print("round {}".format(j))
        print(clflist[i])
        print("accuracy {}".format(acc))
        print("balanced accuracy {}".format(b_acc))
        print("="*20)

> output
round 0
RF
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
====================
round 0
GNB
accuracy 0.6625514403292181
balanced accuracy 0.680450351191296
====================
round 0
XGB
accuracy 0.8271604938271605
balanced accuracy 0.8136275995042005
====================
round 0
SVC
accuracy 0.7325102880658436
balanced accuracy 0.7072717256576229
====================
round 0
SCLF
accuracy 0.8148148148148148
balanced accuracy 0.8026786943947115
====================
round 1
RF
accuracy 0.7603305785123967
balanced accuracy 0.7534683684794672
====================
round 1
GNB
accuracy 0.640495867768595
balanced accuracy 0.6634988901220866
====================
round 1
XGB
accuracy 0.8140495867768595
balanced accuracy 0.8127081021087681
====================
round 1
SVC
accuracy 0.756198347107438
balanced accuracy 0.7414678135405106
===================
.....

Reader who is interested in stacking, you can find nice document here
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#example-1-simple-stacked-classification

And my all code is uploaded to myrepo on github.
http://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/mlxtend_test.ipynb

Mlxtend has many functions for building, analyzing and visualizing machine learning model and data. I will use the package more and more.

Change properties of approved oral drugs

When I learned drug discovery long time ago, I read the article about Role of five which is a rule of thumb to evaluate druglikeness.
https://www.sciencedirect.com/science/article/pii/S0169409X00001290?via%3Dihub
You can read nice review about the druglikess scores in following URL.
( Written in Japanese ;-) )
View at Medium.com

By the way, recently there are many articles which are describing about beyond the Role of five . I read an article published from JMC, reported by researcher of NIB.
https://www.ncbi.nlm.nih.gov/pubmed/30212196

The author analyzed the change over time of properties such as MW, logP, HBA, HBD, TPSA and NumAromaticRing.
In page B, the author reported comparison between experimental and calculated (StarDrop, Pomona, Moka, Crippen) value of LogP which is determined in Novartis. And the experiments shows that StarDrop gave the lowest variability between measured and calculated logP. Most of tools overestimate in high log P area but stardrop does not. It worth to know that clogp is largely affected by computational tools.

In Table5 shows analysis of FDA approved oral NCEs from 1998 to 2017. Mw, RotBm and #ArRNG showed significance.
For example 90th percentile of Mw is 571.0 compared to 1997 90th percentile 470.3. It dramatically increased!
And also 90th percentile of #ArRNG is 4 compared to 1997 90th percentile 3.
Mw is changed dramatically I think. Fig 6 shows Number of ratio of MW in approved drug in each time period. This figure shows tendency to decrease of ratio of the drug which has MW <= 500.
By the way, HBD is not changed.

The author discussed about why size (MW) is matter and described that higher molecular weight is not be indicator of druglikeness but be estimator of synthetic accessibility.
To synthesize molecule with high molecular weight, it needs more building blocks and / or longer synthetic steps. So it is needed more effort to SAR expansion in medicinal chemistry projects.

Ro5 is easy to understand and reasonable role for medicinal chemistry but it is changing. Recently there are many modalities for drug discovery.
Changing role is needed long time and many efforts…..
More details are described in the article. It is informative for me.

Vote Vote Vote #chemoinformatics

Somedays ago, I posted about ensemble classification method named ‘blending’. The method is not implemented in scikit-learn. So I am implementing the function now.
By the way, scikit-learn has an ensemble classification method named ‘VotingClassifer’.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier
Following explanation from sklearn document.

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

The classifier can combine many classifiers very easily.
The function has two modes, one is hard and the other is soft.
From document…
If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.

I used the classifier for QSAR modeling.

Following code, I used four classifier and BASE.csv from molecule net as test dataset.
Code is very simple! Just pass defined dictionary to VotingClassifier.

import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier

clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

voting_clf = VotingClassifier(estimators=[("RF", clf_dict["RF"]),
                                        ("GBC", clf_dict["GBC"]),
                                        ("XGB", clf_dict["XGB"]),
                                        ("SVC", clf_dict["SVC"])        
                                    ], voting='hard')

dataset = np.load("train.npz")['arr_0']
X = dataset[:,:-1]
y = dataset[:,-1]
idx = np.random.permutation(y.size)
X = X[idx]
y = y[idx]
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)
nfolds = 10
skf = StratifiedKFold(nfolds)
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(test_X)
print("Voting!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

rf_clf = RandomForestClassifier(n_estimators=100)
 
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    rf_clf.fit(X_train, y_train)
y_pred = rf_clf.predict(test_X)
print("Random Forest!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred))

svc_clf = SVC(probability=True, gamma='auto')
for i, (train, val) in enumerate(skf.split(train_X, train_y)):
    #print('fold {}'.format(i))
    X_train = train_X[train]
    y_train = train_y[train]
    X_val = train_X[val]
    y_val = train_y[val]
    svc_clf.fit(X_train, y_train)
y_pred = svc_clf.predict(test_X)
print("SV!")
print(confusion_matrix(test_y, y_pred))
print(classification_report(test_y, y_pred)) 

Then run the code.
In this example voting method does not outperform to random forest, support vector classifier. But it is worth to know that sklearn provides useful feature for ensemble learning I think. ;-)

iwatobipen$ python voting.py 
Voting!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Random Forest!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

SV!
[[10  0  0]
 [ 0 11  0]
 [ 0  1  8]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        10
         1.0       0.92      1.00      0.96        11
         2.0       1.00      0.89      0.94         9

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

Visualize pharmacophore in RDKit #RDKit

RDKit has pharmacophore feature assignment function. The function can retrieve molecular features based on pre-defined ph4core.
And RDKit IPythonconsole can draw molecules on ipython notebook.
Today I tried to visualize ph4core on notebook.
Code is very simple.

from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import rdBase
from rdkit.RDPaths import RDDocsDir
from rdkit.RDPaths import RDDataDir
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
import os
print(rdBase.rdkitVersion)
IPythonConsole.ipython_useSVG=True
> 2018.09.1

First, load feature definition.

fdefFile = os.path.join(RDDataDir,'BaseFeatures.fdef')
featFact = ChemicalFeatures.BuildFeatureFactory(fdefFile)

Then calculate pharmacophore. And compute 2D cordes.

mols = [m for m in Chem.SDMolSupplier(os.path.join(RDDocsDir,"Book/data/cdk2.sdf"))]
featslists = [featFact.GetFeaturesForMol(mol) for mol in mols]
for mol in mols:
    AllChem.Compute2DCoords(mol)

Next I defined drawing function. To highlight ph4core, highlightAtomLists and legends are used as optional arguments of MolsToGridImage.

def drawp4core(mol, feats):
    atoms_list = {}
    for feat in feats:
        atom_ids = feat.GetAtomIds()
        feat_type = feat.GetType()
        atoms_list[feat_type] = atom_ids
    return Draw.MolsToGridImage([mol]*len(atoms_list), legends=list(atoms_list.keys()), highlightAtomLists=list(atoms_list.values()))

Test the function.

im = drawp4core(mols[1], featslists[1])
im

I could get following image.

The function worked and it could pick up pharmacophore of given molecule. ;-)

Applicable Domain on Deep Neural Networks #JCIM #chemoinformatics

I read interesting article from JCIM.
Dissecting Machine-Learning Prediction of Molecular Activity: Is an
Applicability Domain Needed for Quantitative Structure−Activity
Relationship Models Based on Deep Neural Networks?

URL is below.
https://pubs.acs.org/doi/10.1021/acs.jcim.8b00348

The pros of DNN is feature extraction. And there are many articles which use DNN for molecular activity prediction. BTW, is it true that DNN is outperform any other machine learning methods?

The authors of the article analyzed the performance of DNN. They used ECFP4 as an input feature and predicted biological activities extracted from CHEMBL DB.
Their approach was reasonable, they built model with training set and check the performance with test data and evaluate RMSE in several layers which are defined by molecular similarity. Layer1 means that dataset is similar to training data and Layer6 means that dataset is not similar to training set.

They analyzed performance of predictive method such as KNN, RF and DNN and their analysis revealed that DNN showed similar performance with RF and KNN. And also Fig 5 shows that DNN can not predict objective value when query molecule is not similar to training set.
It indicates that DNN does not learn feature of molecule from finger print but learned pattern of fingerprint.
More details are described in this article.
In the real drug discovery project, MedChem sometime designs not seed similar compounds. For the chemist, this is not so special. But it is difficult point for AI to learn sense of MedChem.

Biological activity prediction is challenging area I think, And ECFP is still de fact standard for chemoinformatics. I would like to develop new concept of molecular descriptors.

Generate possible list of SMLIES with RDKit #RDKit

In the computer vision, it is often used data augmentation technique for getting large data set. On the other hand, Canonical SMILES representations are used in chemoinformatics area.
RDKit UGM in last year, Dr. Esben proposed new approach for RNN with SMILES. He expanded 602 training molecules to almost 8000 molecules with different smiles representation technique.

Click to access Bjerrum_RDKitUGM_Smiles_Enumeration_for_RNN.pdf

This approach seems works well.
In the UGM hackathon at this year, this random smiles generate function is implemented and it can call from new version of RDKit!
I appreciate rdkit developers!

It is very easy to use, pls see the code below.

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
print(rdBase.rdkitVersion)
>2018.09.1

I used kinase inhibitor as an example.

testsmi = 'CC(C1=C(C=CC(=C1Cl)F)Cl)OC2=C(N=CC(=C2)C3=CN(N=C3)C4CCNCC4)N'
mol = Chem.MolFromSmiles(testsmi)
mol


Default output of MolToSmiles is canonical manner.

print(Chem.MolToSmiles(mol))
>CC(Oc1cc(-c2cnn(C3CCNCC3)c2)cnc1N)c1c(Cl)ccc(F)c1Cl

But if you set MolToSmiles with doRandom=True option, the function return random but valid SMILES.

mols = []
for _ in range(50):
  smi = Chem.MolToSmiles(mol, doRandom=True)
  print(smi)
  m = Chem.MolFromSmiles(smi)
  mols.append(m)

>Fc1c(Cl)c(C(Oc2cc(-c3cn(nc3)C3CCNCC3)cnc2N)C)c(cc1)Cl
>O(c1cc(-c2cnn(c2)C2CCNCC2)cnc1N)C(c1c(Cl)c(ccc1Cl)F)C
>--snip--
>c1(N)ncc(-c2cnn(C3CCNCC3)c2)cc1OC(c1c(Cl)ccc(F)c1Cl)C
#check molecules
Draw.MolsToGridImage(mols, molsPerRow = 10)

Different SMILES but same molecule!

There are many deep learning approaches which use SMIELS as input. It is useful for these models to augment input data I think.

I uploaded my example code on google colab and github my repo.

Colab
https://colab.research.google.com/drive/1dMmgCpskrfI1afh3qmPdv8ixkGTuOCDH

github
https://github.com/iwatobipen/playground/blob/master/random_smiles_rdkit.ipynb

nbviewer
http://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/random_smiles_rdkit.ipynb

Tracking progress of machine learning #MachineLearning

To conduct machine learning it is needed to optimize hyper parameters.
For example scikit-learn provides grid search method. And you know there are several packages to do that such as hyperopt or gyopt etc. How do you mange builded models? It is difficult for me.
Recently I am interested in mlflow . MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It tackles three primary functions.
MLflow can track each model of hyper parameters and serve the models and also it can provide good web UI.
I used it in very simple example.
Code is below.
At first I got sample data and visualize the data set with PCA.

# on the ipython notebook
%matplotlib inline
!wget https://raw.githubusercontent.com/mlflow/mlflow/master/examples/sklearn_elasticnet_wine/wine-quality.csv -P ./data/
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cm
import pandas as pd
import numpy as np

from sklearn.decomposition import PCA
data = pd.read_csv('./data/wine-quality.csv')

cmap = plt.get_cmap("Blues", len(data.quality.unique()))
pca = PCA()
wine_pca = pca.fit_transform(data.iloc[:,:-1])
plt.scatter(wine_pca[:,0], wine_pca[:,1], c=data.quality, cmap=cmap)
plt.xlim(np.min(wine_pca[:,0]), np.max(wine_pca[:,0]))
plt.ylim(np.min(wine_pca[:,1]), np.max(wine_pca[:,1]))
plt.colorbar()

Next train function is defined.
mlflow.log_param function can track scoring parameters and mlflow.sklearn.log_model can store the model.
After running the code, mlruns folder is generated in current directory and stored data.


def train():

    import os
    import warnings
    import sys
    import pandas as pd
    import numpy as np
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    from sklearn.model_selection import train_test_split
    from sklearn.svm import SVR
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import cross_val_score
    import mlflow
    import mlflow.sklearn

    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mae, r2

    warnings.filterwarnings("ignore")
    np.random.seed(40)
    data = pd.read_csv('./data/wine-quality.csv')
    train, test = train_test_split(data)

    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]
    param = {'C':[0.01, 0.1, 1, 10, 100, 1000, 10000 ],
             'gamma':[1.0, 1e-1, 1e-2, 1e-3, 1e-4]}
    for c in param['C']:
        for g in param['gamma']:
            with mlflow.start_run():
                print(c,g)
                skf = StratifiedKFold(n_splits=5)
                svr = SVR(C=c, gamma=g)score = cross_val_score(svr, train_x, train_y, cv=skf, n_jobs=-1)
                svr.fit(train_x, train_y)
                predicted_qualities = svr.predict(test_x)
                (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
                print("  RMSE: %s" % rmse)
                print("  MAE: %s" % mae)
                print("  R2: %s" % r2)
                mlflow.log_param("C", c)
                mlflow.log_param("gamma", g)
                mlflow.log_metric("r2", r2)
                mlflow.log_metric("rmse", rmse)
                mlflow.log_metric("mae", mae)
                mlflow.sklearn.log_model(svr, "model")

Run the function.

train()
0.01 1.0
  RMSE: 0.876717955304063
  MAE: 0.6586558965180616
  R2: 0.007250505904323745
0.01 0.1
  RMSE: 0.872902609847314
  MAE: 0.6523680676966712
  R2: 0.015872299345786156
--snip--
10000 0.0001
  RMSE: 0.7902872331540974
  MAE: 0.570097398346025
  R2: 0.19334133272639453
'''

After running the code, MLflow can provide very useful webUI. To access the UI, just type following command from terminal ;-).
And then access http://127.0.0.1:5000/#/.

iwatobipen$ mlflow server

I can check the summary of the training with metrics like below.

And each model is stored. I can see details and access each model like below.

It is useful to manage many experiments and many models I think.

Ensemble learning with scikit-learn and XGBoost #machine learning

I often post about the topics of deep learning. But today I would like to post about ensemble learning.
There are lots of documents describes Ensemble learning. And I think following document is very informative for me.

Kaggle Ensembling Guide


I interested one of the method, named ‘blending’.
Regarding above URL, the merit of ‘blending’ are …

Blending has a few benefits:

It is simpler than stacking.
It wards against an information leak: The generalizers and stackers use different data.
You do not need to share a seed for stratified folds with your teammates. Anyone can throw models in the ‘blender’ and the blender decides if it wants to keep that model or not.

There are two layers in blending. First layer is a set of multiple classifiers that is trained with training data. And second layer is a classifier that is trained with output of the test set of the first layer.
I tried to write a code for blending. Following code I used scikit-learn and XGBoost.

First, import libraries and define the dictionary for my conveniense.

import click
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
l1_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

l2_clf_dict = {'RF': RandomForestClassifier(n_estimators=100),
        'ETC': ExtraTreesClassifier(n_estimators=100),
        'GBC': GradientBoostingClassifier(learning_rate=0.05),
        'XGB': XGBClassifier(n_estimators=100),
        'SVC': SVC(probability=True, gamma='auto')}

Then defined model build function. Following code can be applied multiple classification problem.
The code seems a little bit complicated and it can only return set of trained classifiers. I would like to improve the code in the near the future.

@click.command()
@click.option('--l1', default='all', type=str, help='models of first layers input format is csv. RF/ETC/GBC/XGB/SVC')
@click.option('--l2', default='XGB', type=str, help='model of second layer default is XGB')
@click.option('--nfolds', default=10, type=int, help='number of KFolds default is 10')
@click.option('--traindata', default='train.npz', type=str, help='data for training')
def buildmodel(l1, l2, nfolds, traindata):
    skf = StratifiedKFold(nfolds)
    dataset = np.load(traindata)['arr_0']
    X = dataset[:,:-1]
    y = dataset[:,-1]
    idx = np.random.permutation(y.size)
    X = X[idx]
    y = y[idx]
    num_cls = len(set(y))
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=794)

    if l1 == 'all':
        l1 = list(l1_clf_dict.keys())
        clfs = list(l1_clf_dict.values())
    else:
        clfs = [l1_clf_dict[clf] for clf in l1.split(',')]

    dataset_blend_train = np.zeros((train_X.shape[0], len(clfs), num_cls ))
    dataset_blend_test = np.zeros((test_X.shape[0], len(clfs), num_cls ))

    for j, clf in enumerate(clfs):
        dataset_blend_test_j = np.zeros((test_X.shape[0], nfolds, num_cls))
        for i, (train, val) in enumerate(skf.split(train_X, train_y)):
            print('fold {}'.format(i))
            X_train = train_X[train]
            y_train = train_y[train]
            X_val = train_X[val]
            y_val = train_y[val]
            clf.fit(X_train, y_train)
            # use clfs predicted value for next layer's training
            y_pred = clf.predict_proba(X_val)
            dataset_blend_train[val, j, :] = y_pred
            dataset_blend_test_j[:, i, :] = clf.predict_proba(test_X)
        dataset_blend_test[:, j, :] = dataset_blend_test_j.mean(1)
    l2_clf = l2_clf_dict[l2]
    print('Blending')
    print(dataset_blend_train.shape)
    dataset_blend_train = dataset_blend_train.reshape((dataset_blend_train.shape[0], -1))
    l2_clf.fit(dataset_blend_train, train_y)

    dataset_blend_test = dataset_blend_test.reshape((dataset_blend_test.shape[0], -1))
    y_pred = l2_clf.predict_proba(dataset_blend_test)
    y_pred
    print(classification_report(test_y, np.argmax(y_pred, 1)))
    print(confusion_matrix(test_y, np.argmax(y_pred, 1)))

    print("*"*50)
    for i, key in enumerate(l1):
        print('layer 1 {}'.format(l1[i]))
        l1_pred = clfs[i].predict_proba(test_X)
        print(classification_report(test_y, np.argmax(l1_pred, 1)))
        print(confusion_matrix(test_y, np.argmax(l1_pred, 1)))  
        print("*"*50) 
    return (clfs, l2_clf)

if __name__=='__main__':
    buildmodel()

Now I just finished making base code.
Let’s make sample code and run it.

# make data
import numpy as np
from sklearn.datasets import load_iris 
x = load_iris().data
y = load_iris().target
data = np.hstack((x, y.reshape(y.size, 1)))
np.savez('train.npz', data)

Run the code :)

iwatobipen$ python blending.py
train.npz
fold 0
fold 1
fold 2
--snip--
fold 7
fold 8
fold 9
Blending
(120, 5, 3)
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      0.83      0.91         6
         2.0       0.92      1.00      0.96        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.97      0.94      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  5  1]
 [ 0  0 11]]
**************************************************
layer 1 RF
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.86      1.00      0.92         6
         2.0       1.00      0.91      0.95        11

   micro avg       0.97      0.97      0.97        30
   macro avg       0.95      0.97      0.96        30
weighted avg       0.97      0.97      0.97        30

[[13  0  0]
 [ 0  6  0]
 [ 0  1 10]]
**************************************************
layer 1 ETC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.71      0.83      0.77         6
         2.0       0.90      0.82      0.86        11

   micro avg       0.90      0.90      0.90        30
   macro avg       0.87      0.88      0.88        30
weighted avg       0.91      0.90      0.90        30

[[13  0  0]
 [ 0  5  1]
 [ 0  2  9]]
**************************************************
layer 1 GBC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
**************************************************
layer 1 XGB
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       0.75      1.00      0.86         6
         2.0       1.00      0.82      0.90        11

   micro avg       0.93      0.93      0.93        30
   macro avg       0.92      0.94      0.92        30
weighted avg       0.95      0.93      0.93        30

[[13  0  0]
 [ 0  6  0]
 [ 0  2  9]]
**************************************************
layer 1 SVC
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        13
         1.0       1.00      1.00      1.00         6
         2.0       1.00      1.00      1.00        11

   micro avg       1.00      1.00      1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

[[13  0  0]
 [ 0  6  0]
 [ 0  0 11]]
**************************************************

All classifiers in the first layer showed good performance. This is very simple case and very small data to estimate of merit of the blending. I will check another data from now.

Ref.
http://www.chioka.in/stacking-blending-and-stacked-generalization/

Visualize important features of machine leaning #RDKit

As you know, rdkit2018 09 01 has very exiting method named ‘DrawMorganBit’ and ‘DrawMorganBits’. It can render the bit information of fingerprint.
It is described the following blog post.
http://rdkit.blogspot.com/2018/10/using-new-fingerprint-bit-rendering-code.html
And if you can read Japanese, Excellent posts are provided.
View at Medium.com
https://future-chem.com/rdkit-fp-visualize/

What I want to do in the blog post is that visualize important features of random forest classifier.
Fingerprint based predictor is sometime difficult to understand feature importance of each bit. I think that using the new method, I can visualize important features of active molecules.

Let’s try it.
At frist import several packages.

import os
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import DataStructs
from rdkit.Chem import RDConfig
from rdkit.Chem import rdBase
print(rdBase.rdkitVersion)

Then load dataset and define mol2fp function.

trainpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf')
testpath = os.path.join(RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf')
solclass = {'(A) low':0, '(B) medium': 1, '(C) high': 2}
n_splits = 10
def mol2fp(mol,nBits=1024):
    bitInfo={}
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2, bitInfo=bitInfo)
    arr = np.zeros((1,))
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr, bitInfo
trainmols = [m for m in Chem.SDMolSupplier(trainpath)]
testmols = [m for m in Chem.SDMolSupplier(testpath)]

Then get X and Y.
In the blogpost I used solubility classification data.

trainX = np.array([mol2fp(m)[0] for m in trainmols])
trainY = np.array([solclass[m.GetProp('SOL_classification')] for m in trainmols], dtype=np.int)

testX = np.array([mol2fp(m)[0] for m in testmols])
testY = np.array([solclass[m.GetProp('SOL_classification')] for m in testmols], dtype=np.int)

Then train RandomForestClassifier.
I used StratifiedKFold instead of KFold.
The difference of two methods is well described following URL.
https://www.kaggle.com/ogrellier/kfold-or-stratifiedkfold

rfclf = RandomForestClassifier(n_estimators=50)
skf = StratifiedKFold(n_splits=n_splits)
for train, test in skf.split(trainX, trainY):
    trainx = trainX[train]
    trainy = trainY[train]
    testx = trainX[test]
    testy = trainY[test]
    rfclf.fit(trainx, trainy)
    predy = rfclf.predict(testx)
    print(accuracy_score(testy, predy))

Then get important features. It seems a little bit tricky. And picked index of high solubility molecule.

fimportance = rfclf.feature_importances_
fimportance_dict = dict(zip(range(1024), fimportance))
sorteddata = sorted(fimportance_dict.items(), key=lambda x:-x[1])
top50feat = [x[0] for x in sorteddata][:50]
solblemol_idx = []
for i, v in enumerate(testY):
    if v == 2:
        solblemol_idx.append(i)

Now ready.
Test!!!!

testidx = solblemol_idx[-1] 
testmols[testidx]

rfclf.predict([testX[testidx]])
array([2])

Get fingerprint and bit information and extract important features from onBits.

fp, bitinfo = mol2fp(testmols[testidx])
onbit = [bit for bit in bitinfo.keys()]
importantonbits = list(set(onbit) & set(top50feat))

Finally visualize important bit information with structures.

tpls = [(testmols[testidx], x, bitinfo) for x in importantonbits]
Draw.DrawMorganBits(tpls, legends = [str(x) for x in importantonbits])

This classifier can detect substructures with oxygen atom and sp3 carbon.
The example molecule has amino functional groups but the predictor does not detect nitrogen as an important group.
I think this is because the model is not well optimized.
But it is useful that visualize features of molecule.
All code is uploaded to google colab.
Readers who are interested the code. Can run the code on the colab!
https://colab.research.google.com/drive/16NmSqHMD3Tw562PeCh0u22OB9oPyqCsw

convert rdkit mol object to schrodinger’s mol object #RDKit #Chemoinformatics

I posted a memo about how to read maestro file format from RDKit. It means that rdkitter can use “mae” format from RDKit. ;-)
BTW, schrodinger’s site provides API for python. I would like to know the way to communicate rdkit from schrodinger python API.
https://www.schrodinger.com/pythonapi

I read the API in lunch break and tested some methods. Fortunately it worked well.
Example is below.
“rdkit_adapter” method provides a tool to connect rdkit-schrodinger’s python.

At first read sample sdf from rdkit.

from schrodinger.thirdparty import rdkit_adapter
from rdkit import Chem
mols = [ mol for mol in Chem.SDMolSupplier("solubility.sdf") if mol != None]

Then sanitize mol and convert rdkit mol object to schrodinger mol object. It can do with rdkit_adapter.from_rdkit method like this.

for mol in mols:
    Chem.SanitizeMol(mol)
schromols = []
for mol in mols:
    try:
        shmol = rdkit_adapter.from_rdkit(mol)
        schromols.append(shmol)
    except:
        pass

Now I could convert rdkit mol objects to schrodinger mol objects. Next I tried conversion from schrodinger mol objects to rdkit mol objects. I used rdkit_adapter.to_rdkit method.

rdmols = []
for mol in schromols:
    try:
        rdmol = rdkit_adapter.to_rdkit(mol)
        rdmols.append(rdmol)
    except:
        pass

I could not show output because I wrote this blog post from my personal computer. But above code worked.
And there are many methods are provided to schrodinger mol obj. Hmm it seems exciting for me. I check the API more deeply!

Read maestro format file from RDKit

RDKitter knows that Schrodinger contributes RDKit I think.
https://www.schrodinger.com/news/schr%C3%B6dinger-contributes-rdkit

Schrodinger provides many computational tools for drug discovery, that is not only GUI tool but also python API. Many tool can call from python and also RDKit. And RDKit can read maestro file vise versa.
It is easy to do it like reading SDFiles.
I am writing the blog post on my personal PC and I do not have schrodinger software’s license. So I got test files from schrodingers github repository.
https://github.com/schrodinger

On ipython notebook

!wget https://raw.githubusercontent.com/schrodinger/maeparser/master/test/test.mae
!wget https://github.com/schrodinger/maeparser/raw/master/test/test2.maegz
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem import rdmolfiles
from rdkit.Chem.Draw import rdDepictor
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
import gzip
rdDepictor.SetPreferCoordGen(True)
rdBase.rdkitVersion
>'2018.09.1'

Read mols from mae format.

maemols = rdmolfiles.MaeMolSupplier("test.mae")
mols = [m for m in maemols]
Draw.MolsToGridImage(mols)

Read mols form maegz

maemols2 = rdmolfiles.MaeMolSupplier(gzip.open("test2.maegz"))
mols2 = [m for m in maemols2]
Draw.MolsToGridImage(mols2)

Both cases work fine.
If user want to integrate rdkit and schrodinger tools, the method will be useful.
Note book uploaded my github repo.
https://nbviewer.jupyter.org/github/iwatobipen/playground/blob/master/maeparser.ipynb

Run rdkit and deep learning on Google Colab! #RDKit

If you can not use GPU on your PC, it is worth to know that you can use GPU and/or TPU on google colab. Now you can use google colab no fee.
So, I would like to use rdkit on google colab and run deep learning on the app.
Today I tried it.
At first I installed RDKit on the instance. It is not so difficult.
Now default version of python is 3.7. Fortunately conda-forge provides us rdkit-py37 recipe!

The code is below.

!wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
!chmod +x Miniconda3-latest-Linux-x86_64.sh
!time bash ./Miniconda3-latest-Linux-x86_64.sh -b -f -p /usr/local
!time conda install -q -y -c conda-forge rdkit

Then append rdkit path to current python system path.

%matplotlib inline
import matplotlib.pyplot as plt
import sys
import os
sys.path.append('/usr/local/lib/python3.7/site-packages/')

Now I can import rdkit! Next import rdkit and load dataset.

import numpy as np
import pandas as pd

from rdkit import Chem
from rdkit.Chem import DataStructs
from rdkit.Chem import AllChem
from rdkit.Chem import RDConfig
from rdkit import rdBase
from rdkit.Chem.Draw import IPythonConsole

trainsdf = Chem.SDMolSupplier(os.path.join( RDConfig.RDDocsDir, 'Book/data/solubility.train.sdf'))
testsdf = Chem.SDMolSupplier(os.path.join( RDConfig.RDDocsDir, 'Book/data/solubility.test.sdf'))
train_mols = [mol for mol in trainsdf if mol != None]
test_mols = [mol for mol in testsdf if mol != None]
sol_class = {"(A) low":0, "(B) medium":1, "(C) high": 2}

Next, define mol to fingerprint function and make dataset for deep learning.

# 2048 bit vector
def mol2arr(mol):
  arr = np.zeros((1,))
  fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
  DataStructs.ConvertToNumpyArray(fp, arr)
  return arr

In this blog post, I use keras for convenience.

import tensorflow as tf
import tensorflow.keras as keras
from keras import Model
from keras.layers import Activation, Dense, Dropout, Input
from keras.utils import np_utils


trainX = np.array([mol2arr(mol) for mol in train_mols])
trainY = [sol_class[mol.GetProp("SOL_classification")] for mol in train_mols]
trainY = np_utils.to_categorical(trainY)

testX = np.array([mol2arr(mol) for mol in test_mols])
testY = [sol_class[mol.GetProp("SOL_classification")] for mol in test_mols]
testY = np_utils.to_categorical(testY)

Define model, and I would like to check intermediate layer, so I also defined model_int.

inputs = Input(shape=(2048,), name='input')
x = Dense(100, activation='relu', name='Layer1')(inputs)
x = Dense(20, activation='relu', name='Layer2')(x)
x = Dense(2, activation='relu', name='Layer3')(x)
predictions = Dense(3, activation='softmax', name='output')(x)

model = Model(inputs=inputs, outputs = predictions)
model.compile(optimizer=tf.train.RMSPropOptimizer(0.001),
                           loss='categorical_crossentropy',
                           metrics=['accuracy'])
model.summary()


model_int = Model(inputs=inputs, outputs=model.get_layer(name='Layer3').output)
model_int.summary()

OK, let’s train model!

epochs = 50
hist = model.fit(np.array(trainX), trainY, batch_size=32, epochs=epochs)
plt.plot(range(epochs), hist.history['acc'])
plt.plot(range(epochs), hist.history['loss'])

50 epochs is too long for the training.
Blue: training_acc, green: training_loss

Next check the model performance.

from sklearn.metrics import classification_report
predY = model.predict(testX)
predY = [np.argmax(y) for y in predY]
Y = [np.argmax(y) for y in testY]
print(classification_report(Y, predY))
"""
             precision    recall  f1-score   support

          0       0.67      0.80      0.73       102
          1       0.75      0.61      0.67       115
          2       0.73      0.75      0.74        40

avg / total       0.72      0.71      0.71       257
"""

Hmm it is not so bad…

And I would like to visualize relation ship between last layer and label.

intdata = model_int.predict(testX)
colormap = {0:"red", 1:"blue", 2:"green"}
c = [colormap[i] for i in Y]
plt.scatter(intdata[:,0], intdata[:,1], c=c)

The image is below.

It is not so clear. Am I something wrong? There is a space for improve.
But it is worth to know that google colab is useful for test your code of machine learning.

Whole code can be checked from the following URL.
https://colab.research.google.com/drive/1_m15X_1UoEzPRkFi9vditQ9xM9hemzhv

Make predictive models with small data and visualize it #Chemoinformatics

I enjoyed chemoinformatics conference held in Kumamoto in this week.
The first day of the conference, I could hear about very interesting lecture. That was very basic data handling and visualization tutorial but useful for newbie of chemoinformatics.
I would like to reproduce the code example, so I tried it.

First, visualize training data. It important to know the properties of training data.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVC, SVR
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
points = [(1.5, 2), (2, 1), (2, 3), (3, 5), (4, 3), (7, 6), (9,  10)]
label = [1, -1, 1, 1, -1, -1, 1]
def color_map(label):
    if label > 0:
        return 'blue'
    return 'red'
train_color = list(map(color_map, label))
# check data
train_x = [ i[0] for i in points]
train_y = [ i[1] for i in points]
plt.scatter(x=train_x, y=train_y, c=train_color)
plt.xlim(0,15)
plt.ylim(0,15)

Hmm, it seems linear relationship between x and y.

Next, made test data and helper function for visualize data.

test_x = np.linspace(0, 10, 20)
test_y = np.linspace(0, 10, 20)
xx, yy = np.meshgrid(test_x, test_y)
test_x = xx.ravel()
test_y = yy.ravel()
n_data = len(test_x)
test_data = [(test_x[i], test_y[i]) for i in range(n_data)]

def makeplot(test_x, test_y, predict_data):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.scatter(train_x, train_y, c=train_color)
    color = list(map(color_map, predict_data))
    ax.scatter(test_x, test_y, c = color, alpha=0.3)
    fig.show()

OK let’s build model!

# Linear Regression
model = LinearRegression()
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)


Oh, Simple linear regressor works very well. ;-)
OK, next how about random forest ?

#Random forest
model = RandomForestClassifier(random_state=np.random.seed(794))
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)


The result is quite different from linear regressor.

Next I checked non linear data.

points = [(1, 5), (2, 10), (2, 3), (3, 5), (5, 4), (7, 6), (9,  10), (11,2), (7, 3)]
label = [1, 1, 1, 1, -1, -1, 1, 1, -1]
def color_map(label):
    if label > 0:
        return 'blue'
    return 'red'

train_color = list(map(color_map, label))
train_x = [ i[0] for i in points]
train_y = [ i[1] for i in points]
plt.scatter(x=train_x, y=train_y, c=train_color)
plt.xlim(0,15)
plt.ylim(0,15)

In this case, linear model does not work well.

#Ridge
model = Ridge()
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

How about RF and SVR?

#RandomForest
model = RandomForestRegressor(random_state=np.random.seed(794))
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

#SVR
model = SVR()
predictor = model.fit(points, label)
reg = predictor.predict(test_data)
makeplot(test_x, test_y, reg)

RF

SVR

Non linear regressor can fit non linear data but shows different output.
Model selection is important and to select model, it is needed check training data carefully.
All my experiments can check from google colab.

https://colab.research.google.com/drive/1ywqRlcjEPm7pLP-IeawPTsclb9siuFI4

Any comments and suggestions are appreciated.

standardization of tautomers #RDKit

One of the hot topic of new version of RDKit is an integration of MolVS which is tool for molecular standardization.
Molecular standardization is important for not only chemist but also chemoinformatist. Because tautomer shows different representation of molecule and it will be affect accuracy of QSAR models.
I wrote molecular standardization tools named ‘MolVS’ before and MolVS is an another library at the time. Now we can call molvs from native RDKit.
I used 2-hydroxy prydine as an example.

from rdkit import Chem
from rdkit import rdBase
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

rdBase.rdkitVersion

from rdkit.Chem import MolStandardize

smi1 = 'c1cccc(O)n1'
mol1 = Chem.MolFromSmiles(smi1)
smi2 = 'C1=CC(=O)NC=C1'
mol2 = Chem.MolFromSmiles(smi2)

Draw.MolsToGridImage([mol1, mol2])

Same formula but different structure.

Standardization method is very simple.

stsmi1 = MolStandardize.canonicalize_tautomer_smiles(smi1)
stsmi2 = MolStandardize.canonicalize_tautomer_smiles(smi2)

Draw.MolsToGridImage([Chem.MolFromSmiles(stsmi1), Chem.MolFromSmiles(stsmi2)])

Also it is easy to get possible tautomers from a smiles. And MolStandarize class has many method. It is very useful for data preprocessing I think.

tautomers = MolStandardize.enumerate_tautomers_smiles(smi1)
print(tautomers)
>{'O=c1cccc[nH]1', 'Oc1ccccn1'}
dir(MolStandardize)
['MolVSError',
 'StandardizeError',
 'Standardizer',
 'ValidateError',
 'Validator',
,,,,
 'canonicalize_tautomer_smiles',
 'charge',
 'division',
 'enumerate_tautomers_smiles',
 'errors',
 'fragment',
 'log',
 'logging',
 'metal',
 'normalize',
 'print_function',
 'standardize',
 'standardize_smiles',
 'tautomer',
 'unicode_literals',
 'utils',
 'validate',
 'validate_smiles',
 'validations']

I uploaded the snippet to my repo. It can read from following URL.
https://nbviewer.jupyter.org/github/iwatobipen/chemo_info/blob/master/rdkit_notebook/new_fp_generator.ipynb

P.S.
I will go to Kumamoto to participate chemoinformatics conference tomorrow. I hope I can have many fruitful discussions.